更好的 Harness：使用 Evals 进行 Harness Hill-Climbing 的秘方

TL;DR: We can build better agents by building better harnesses. But to autonomously build a “better” harness, we need a strong learning signal to “hill-climb” on. We share how we use evals as that signal, plus design decisions that help our agent generalize instead of overfit. Better-Harness is a prototype system for iteratively sourcing and improving your harness with evals.

TL;DR： 我们可以通过构建更好的测试框架来构建更好的代理。但是要自主构建一个「更好」的测试框架，我们需要一个强大的学习信号来进行「hill-climb」。我们分享了如何使用评估作为该信号，以及帮助我们的代理泛化而不是过度拟合的设计决策。Better-Harness 是一个原型系统，用于迭代收集和改进您的测试框架以使用评估。

Evals are training data for Agents

Evals 是 Agents 的训练数据

In classical machine learning, training data guides the model’s learning process. Each training example contributes a gradient that updates the model’s weights toward “correctness.” We have a similar learning loop for agents.

在经典机器学习中，训练数据指导模型的学习过程。每个训练示例贡献一个梯度，将模型的权重更新向「正确性」。我们为代理有一个类似的学习循环。

model + training data + gradient descent → better model

harness + evals + harness engineering → better agent

测试框架 + 评估 + 测试框架工程 → 更好的代理

Evals encode the behavior we want our agent to exhibit in production. They’re the "training data" for harness engineering. Each eval case contributes a signal like “did the agent take the right action” or “produce the right outcome?” That signal guides the next proposed edit to the harness.

评估编码了我们希望代理在生产环境中展现的行为。 它们是测试框架工程的「训练数据」。每个评估案例贡献一个信号，比如「代理是否采取了正确行动」或「产生了正确结果？」这个信号指导对测试框架的下一个提议编辑。

The same rigor and care we put into data quality and curation for model training should also go into eval design. We discuss the importance of data quality in a previous post, how we build evals for Deep Agents.

我们为模型训练投入在数据质量和策展上的相同严谨和细致，也应该用于 eval 设计。我们在之前的文章中讨论了数据质量的重要性，how we build evals for Deep Agents。

There’s some great recent work that formalize the steps to optimize harnesses including Meta-Harness from Stanford and Auto-Harness from DeepMind. We also previously shared a Harness Improvement Loop to hill-climb Terminal Bench 2.0 by just tweaking the harness ...