揭秘 AI 智能体的评估

Introduction

Good evaluations help teams ship AI agents more confidently. Without them, it’s easy to get stuck in reactive loops—catching issues only in production, where fixing one failure creates others. Evals make problems and behavioral changes visible before they affect users, and their value compounds over the lifecycle of an agent.

良好的评估有助于团队更自信地发布 AI agents。没有它们，很容易陷入反应式循环——仅在生产环境中捕获问题，在那里修复一个故障会引发其他问题。Evals 使问题和行为变化在影响用户之前变得可见，并且它们的价值在 agent 的生命周期中会累积。

As we described in Building effective agents, agents operate over many turns: calling tools, modifying state, and adapting based on intermediate results. These same capabilities that make AI agents useful—autonomy, intelligence, and flexibility—also make them harder to evaluate.

正如我们在 Building effective agents 中所述，agents 会在多个回合中运行：调用工具、修改状态，并根据中间结果进行适应。这些使 AI agents 有用的相同能力——autonomy、intelligence 和 flexibility——也使它们更难评估。

Through our internal work and with customers at the frontier of agent development, we’ve learned how to design more rigorous and useful evals for agents. Here's what's worked across a range of agent architectures and use cases in real-world deployment.

通过我们的内部工作以及与 agent 开发前沿客户的合作，我们学会了如何为 agents 设计更严谨和有用的 evals。这里是跨各种 agent architectures 和真实世界部署用例中有效的方法。

The structure of an evaluation

An evaluation (“eval”) is a test for an AI system: give an AI an input, then apply grading logic to its output to measure success. In this post, we focus on automated evals that can be run during development without real users.

评估（“eval”）是对 AI 系统的测试：给 AI 一个输入，然后对其输出应用评分逻辑来衡量成功。在本文中，我们关注自动化评估，这些评估可以在没有真实用户的情况下在开发期间运行。

Single-turn evaluations are straightforward: a prompt, a response, and grading logic. For earlier LLMs, single-turn, non-agentic evals were the main evaluation method. As AI capabilities have advanced, multi-turn evaluations have become increasingly common.

Single-turn evaluations 很简单：一个 prompt、一个 response 和 grading logic。对于早期的 LLMs，single-turn、非 agentic 的 evals 是主要评估方法。随着 AI 能力进步，multi-turn evaluations 变...