我们如何为 Deep Agents 构建评测

TLDR: The best agent evals directly measure an agent behavior we care about. Here’s how we source data, create metrics, and run well-scoped, targeted experiments over time to make agents more accurate and reliable.

TLDR: 最好的 agent evals 能直接衡量我们关心的 agent 行为。以下是我们如何获取数据、创建指标，并随着时间的推移运行范围明确、有针对性的实验，从而使 agents 更准确、更可靠的方法。

Evals shape agent behavior

Evals 塑造 agent 行为

We’ve been curating evaluations to measure and improve Deep Agents. Deep Agents is an open source, model agnostic agent harness that powers products like Fleet and Open SWE. Evals define and shape agent behavior, which is why it’s so important to design them thoughtfully.

我们一直在策划 evaluations 以衡量和改进 Deep Agents。Deep Agents 是一个开源、模型无关的智能体 harness，为 Fleet 和 Open SWE 等产品提供支持。Evals 定义并塑造智能体行为，这就是为什么精心设计它们如此重要。

Every eval is a vector that shifts the behavior of your agentic system. For example, if an eval for efficient file reading fails, you’ll likely tweak the system prompt or the read_file tool description to nudge behavior until it passes. Every eval you keep applies pressure on the overall system over time.

每个评估都是一个改变你的智能体系统行为的向量。例如，如果高效读取文件的评估失败，你很可能会调整系统提示或 read_file 工具描述来引导行为，直到它通过为止。你保留的每一个评估都会随着时间的推移对整个系统施加压力。

It is crucial to be thoughtful when adding evals. It can be tempting to blindly add hundreds (or thousands) of tests. This leads to an illusion of “improving your agent” by scoring well on an eval suite that may not accurately reflect behaviors you care about in production.

在添加评估时保持深思熟虑至关重要。盲目添加数百（或数千）个测试是很诱人的。这会导致一种“改进你的智能体”的错觉，仅仅因为在评估套件中获得了高分，而这些套件可能无法准确反映你在生产环境中关心的行为。

More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production.

更多的 evals ≠ 更好的 agents。相反，应构建反映生产中期望行为的针对性 evals。

When building Deep Agents, we catalog the behaviors that matter in production, such as retrieving content across multiple files in the filesystem or accurately composing 5+ tool calls in sequence. Rather than using benchmark tasks in aggregate, we take the following approach to eval curation:

在构建 Deep Agents 时，我们会记录在生产环境中至关重要的行为，例...