论文公告：利用多模态LLM进行大规模产品检索评估

We are excited to share our latest research paper Retrieve, Annotate, Evaluate, Repeat — Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation. We introduce a novel approach to large-scale product retrieval evaluation using Multimodal Large Language Models (MLLMs). Evaluated on 20,000 examples, our method shows how MLLMs can help automate the relevance assessment of retrieved products, achieving levels of accuracy comparable to human annotators and enabling scalable evaluation for high-traffic e-commerce platforms.

我们很高兴分享我们的最新研究论文Retrieve, Annotate, Evaluate, Repeat — Leveraging Multimodal LLMs for Large-Scale Product Retrieval Evaluation。我们介绍了一种使用多模态大型语言模型（MLLMs）进行大规模产品检索评估的新方法。通过对20,000个示例的评估，我们的方法展示了MLLMs如何帮助自动化检索产品的相关性评估，达到与人类注释者相当的准确性，并实现高流量电子商务平台的可扩展评估。

In summary, our contributions are as follows:

总之，我们的贡献如下：

We introduce a multimodal LLM-based evaluation framework for large-scale product retrieval systems. This framework utilizes LLMs (i) to generate context-specific annotation guidelines and (ii) to conduct relevance assessments.
我们引入了一个基于多模态LLM的大规模产品检索系统评估框架。该框架利用LLM (i) 生成特定上下文的标注指南和 (ii) 进行相关性评估。
We evaluate the performance of our framework against human annotations on real-world production search queries in a multilingual setting and analyse the different types of errors that humans and LLMs tend to make.
我们在多语言环境中对真实生产搜索查询上的人工注释进行评估，并分析人类和 LLM 容易犯的不同类型的错误。
We demonstrate the cost-effectiveness and efficiency of our approach for conducting large-scale evaluations. We also compare the performance of different types of LLMs for relevance assessment, including GPT-4o, GPT-4 Turbo and GPT-3.5 Turbo.
我们展示了我们的方法在进行大规模评估时的成本效益和效率。我们还比较了不同类型的LLM在相关性评估中的表现，包括GPT-4o、GPT-4 Turbo和GPT-3.5 Turbo。

We assess the performance of different types of LLMs in relevance assessment, including GPT-4o, GPT-4 Turbo and GPT-3.5 Turbo. By leveraging Multimodal LLMs (MLLMs) that analyze both text and images, our framework enables a high level of semantic accuracy in evaluating query-product relevance at scale, ...