我们如何使用 DSPy 优化 Dash 的相关性评判器

Dropbox Dash brings your files, messages, and team’s knowledge together in one place, so you can ask questions and get useful answers that are actually grounded in your company’s context. Under the hood, that experience relies heavily on one deceptively simple capability: reliably judging which results are relevant to a query at scale. Relevance judges are used across multiple pipelines like ranking, training data generation, and offline evaluation. Without systematic optimization, they can become a primary source of regressions, cost blowups, and loss of trust as models change.

Dropbox Dash 将您的文件、消息和团队知识整合到一个地方,这样您就可以提出问题并获得真正基于公司上下文的有用答案。在底层,这种体验在很大程度上依赖于一个看似简单但可靠的能力:在规模上可靠地判断哪些结果与 query 相关。Relevance judges 用于多个 pipelines,如 ranking、training data generation 和 offline evaluation。没有系统优化,它们可能会成为模型变化时 regressions、cost blowups 和信任丧失的主要来源。

Making a relevance judge work in production is harder than it looks. A prototype might lean on a state-of-the-art model, but real systems have latency and cost budgets, which usually means migrating to smaller or cheaper models. The catch is that prompts often don’t transfer cleanly across models. We ran into this while scaling our LLM-as-a-judge work: manual prompt tuning got us to a functioning judge, but quality plateaued early and every model swap—or even a small prompt edit—risked regressions in unexpected cases. 

让相关性评判器在生产环境中工作比看起来更难。一个原型可能依赖最先进的模型,但真实系统有延迟和成本预算,这通常意味着迁移到更小或更便宜的模型。问题是提示词往往无法在模型之间干净地转移。我们在扩展我们的 LLM-as-a-judge work 时遇到了这个问题:手动提示词调整让我们得到了一个可用的评判器,但质量很快就达到了平台期,而且每次模型切换——甚至是小的提示词编辑——都可能在意外情况下导致性能倒退。 

To address prompt brittleness and scale up relevance label generation for the long tail of candidates, we brought in DSPy. DSPy is an open-source framework for systematically optimizing prompts against a measurable objective, turning a manual, fragile process into a repeatable optimization loop. In this article, we’ll show how we defined that objective, used DSPy to adapt our judge across models, and made the judge both cheaper and more reliable in production.

为了解决提示...

开通本站会员,查看完整译文。

inicio - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.0. UTC+08:00, 2026-03-19 21:13
浙ICP备14020137号-1 $mapa de visitantes$