我们如何为 Deep Agents 构建 evals

TLDR: The best agent evals directly measure an agent behavior we care about. Here’s how we source data, create metrics, and run well-scoped, targeted experiments over time to make agents more accurate and reliable.

TLDR： 最好的代理评估直接衡量我们关心的代理行为。这里是我们如何获取数据、创建指标，并随着时间运行范围明确、针对性的实验，以使代理更准确和可靠。

Evals shape agent behavior

Evals 塑造 agent behavior

We’ve been curating evaluations to measure and improve Deep Agents. Deep Agents is an open source, model agnostic agent harness that powers products like Fleet and Open SWE. Evals define and shape agent behavior, which is why it’s so important to design them thoughtfully.

我们一直在策划评估来衡量和改进 Deep Agents。Deep Agents 是一个开源的、模型无关的代理框架，它驱动了像 Fleet 和 Open SWE 这样的产品。评估定义并塑造代理行为，这就是为什么设计它们时如此重要，需要深思熟虑。

Every eval is a vector that shifts the behavior of your agentic system. For example, if an eval for efficient file reading fails, you’ll likely tweak the system prompt or the read_file tool description to nudge behavior until it passes. Every eval you keep applies pressure on the overall system over time.

每个 eval 都是一个向量，它会改变你的 agentic 系统行为。例如，如果 efficient file reading 的 eval 失败，你可能会调整 system prompt 或 read_file 工具描述，以调整行为直到通过。每个你保留的 eval 都会随着时间对整体系统施加压力。

It is crucial to be thoughtful when adding evals. It can be tempting to blindly add hundreds (or thousands) of tests. This leads to an illusion of “improving your agent” by scoring well on an eval suite that may not accurately reflect behaviors you care about in production.

在添加 evals 时，深思熟虑至关重要。盲目添加数百（甚至数千）个测试很诱人。这会导致一种错觉，即通过在 eval suite 上获得高分来“改进你的 agent”，而该 suite 可能无法准确反映你在生产环境中关心的行为。

More evals ≠ better agents. Instead, build targeted evals that reflect desired behaviors in production.

更多评估 ≠ 更好的智能体。相反，构建反映生产环境中期望行为的针对性评估。

When building Deep Agents, we catalog the behaviors that matter in production, such as retrieving content across multiple files in the filesystem or accurately composing 5+ tool calls in sequence. Rather than using benchmark tasks in aggregate, we take the following approach to eval curation:

在构建 Deep Agents 时，我们编目生产中重要的 be...