以 AI 作为评判者的搜索质量保障
In 2024, Zalando research published a paper on LLM-as-a-judge for search quality assurance at scale. The framework allows scientists and developers to effectively evaluate the semantic relevance of search results of the given search queries at large scale with multi-language support. This capability has strong potential to help the Search engineering team to quickly identify and fix search issues which we will walk through in this post.
2024 年,Zalando 研究发布了一篇关于大规模 LLM-as-a-judge 用于搜索质量保障 的论文。该框架允许科学家和开发者有效评估给定搜索查询的搜索结果的语义相关性,支持大规模多语言。这种能力有很强的潜力帮助 Search 工程团队快速识别和修复搜索问题,我们将在本文中逐步介绍。
Real-world use case: Launching a new country
真实世界用例:推出新国家
In 2025 Zalando expanded its fashion store business into 3 new countries: Luxembourg, Portugal and Greece. Ensuring these markets have a good search experience is critical for the success of the launch, but the challenge is how can we do that without any prior search data from real users?
2025年,Zalando将其时尚商店业务扩展到3个新国家:Luxembourg、Portugal 和 Greece。确保这些市场拥有良好的搜索体验对于推出成功至关重要,但挑战在于如何在没有来自真实用户的先前搜索数据的情况下做到这一点?
Before using LLM-as-a-judge, the search quality assurance process was heavily reliant on human experts and a manual process as follows:
在使用 LLM-as-a-judge 之前,搜索质量保障过程严重依赖于人类专家和以下手动过程:
- Due to the fact that we do not know which search queries may work well or not in the new markets because they are not live yet, we have to draw sample search queries from the existing markets, and translate them if the new market is operating in different language and test the search system manually. Human experts have to annotate error cases, and identify cases where search returns poor quality results.
- 由于我们不知道在新市场中哪些搜索查询可能有效或无效,因为这些市场尚未上线,我们必须从现有市场中抽取样本搜索查询,如果新市场使用不同语言,则翻译它们,并手动测试搜索系统。人类专家必须标注错误案例,并识别搜索返回低质量结果的案例。
- Root cause diagnosis in both scenarios (errors / poor results) is also performed by the same experts.
- 两种场景(错误 / 低质量结果)下的根本原因诊断也由相同的专家执行。
Not only is this process not scalable, but it is also reactive by nature, meaning that issues are only identified after features are launched and user...