我们如何使用 DSPy 将 AI 评估转化为 Dash 聊天中更优质的回复

The AI features in Dropbox bring together company knowledge from documents, messages, meetings, and other sources. Users can then ask questions in one place and get answers from the Dash chat agent. Agent quality—how well our chat agent helps users accomplish their goals—is evaluated using a suite of large language model-as-judge evaluations. These evaluations provide a way to measure how well an agent is performing and identify opportunities to improve. Rather than judging only a final response, they inspect the full trajectory an agent takes to satisfy a user’s goal: how it interprets intent, gathers context, uses tools, handles ambiguity, grounds its answer, and completes the task.

Dropbox 中的 AI 功能汇集了来自文档、消息、会议和其他来源的公司知识。然后，用户可以在一个地方提问，并从 Dash 聊天智能体获取答案。智能体质量——我们的聊天智能体帮助用户实现目标的程度——通过一系列大语言模型作为评判器的评估来进行衡量。这些评估提供了一种衡量智能体表现并发现改进机会的方法。它们不仅评判最终回复，还会检查智能体为满足用户目标所采取的完整执行轨迹：它如何解释意图、收集上下文、使用工具、处理歧义、为其答案提供依据以及完成任务。

We built agent evaluations as the foundation for improving the chat agent. These evaluations are the powerhouses behind the judges that measure the chat outcomes, given the context available to the agent, including relevance, reasoning quality, evidence use, robustness, task completion, and alignment with user asks. Once we had that foundation, we used DSPy to turn evaluation into improvement. DSPy is an open-source framework for optimizing AI systems using evaluation feedback.

我们将智能体评估作为改进聊天智能体的基础。这些评估是衡量聊天结果的评判器背后的核心动力，在给定智能体可用上下文的情况下，评估包括相关性、推理质量、证据使用、鲁棒性、任务完成情况以及与用户请求的对齐程度。一旦有了这个基础，我们就使用 DSPy 将评估转化为改进。DSPy 是一个开源框架，用于利用评估反馈来优化 AI 系统。

We applied DSPy and its optimization algorithms in two stages. First, we used it to improve the judges themselves, calibrating them against a small set of human-labeled examples so their scores better matched human judgment. Then, we used those improved judges to optimize the chat agent’s system prompt. This created a feedback loop: human labels improved the judges, the judges produced scalable evaluation signals, and those signals improved the agent. As a result, users saw ...