DoorDash如何利用LLMs评估搜索结果页面

At DoorDash, delivering relevant and high-quality search results is essential to ensure that customers find what they’re looking for quickly and effortlessly. Traditionally, evaluating search relevance relied on human annotations, which posed challenges in scale, latency, consistency, and cost. To solve this, we built AutoEval, a human-in-the-loop system for automated search quality evaluation that is powered by large language models (LLMs). Through leveraging LLMs and our whole-page relevance (WPR) metric, AutoEval enables scalable, accurate, and near-real-time search result assessments.
在DoorDash,提供相关且高质量的搜索结果对于确保客户快速且轻松地找到他们所寻找的内容至关重要。传统上,搜索相关性的评估依赖于人工注释,这在规模、延迟、一致性和成本上都面临挑战。为了解决这个问题,我们构建了AutoEval,这是一个由大型语言模型(LLMs)驱动的自动搜索质量评估的人机协作系统。通过利用LLMs和我们的整页相关性(WPR)指标,AutoEval实现了可扩展、准确和近实时的搜索结果评估。
AutoEval has accelerated iteration cycles, improved consistency, and achieved strong alignment with human judgments, even outperforming crowd annotators in key categories. While the system significantly enhances efficiency, it frees up expert raters to focus on guideline development, edge cases, and calibration.
AutoEval加速了迭代周期,提高了一致性,并在关键类别中实现了与人类判断的强对齐,甚至超越了众包注释者。尽管该系统显著提高了效率,但它使专家评估者能够专注于指导方针的制定、边缘案例和校准。

Figure 1: Search page on the DoorDash consumer application.
图1:DoorDash消费者应用程序上的搜索页面。
It’s helpful to understand the limitations of traditional human-driven relevance annotation before we dive into the details of AutoEval and WPR. For years, DoorDash and many others relied on human labelers to evaluate — query by query — the quality of search results. While effective in small batches, this approach simply cannot scale with the burgeoning complexity and size of modern search systems. Among the challenges are:
在深入AutoEval和WPR的细节之前,了解传统人驱动的相关性注释的局限性是有帮助的。多年来,DoorDash和许多其他公司依赖人工标注者逐条评估搜索结果的质量。虽然在小批量中有效,但这种方法无法与现代搜索系统日益增长的复杂性和规模相匹配。面临的挑战包括:
- Scalability constraints: It isn’t feasible to manually assess millions of query-document pairs, especially as search evolves daily.
- 可扩展性限制: 手动评估数百万个查询-文档对是不可行的,特别是随着搜索...