AutoAgent：首个用于自优化智能体的开源库

today we're releasing AutoAgent, an open source library for autonomously improving an agent on any domain.

今天我们发布 AutoAgent，一个用于在任何领域自主改进 agent 的开源库。

AutoAgent hit both the #1 on SpreadsheetBench (96.5%) and the #1 GPT-5 score on TerminalBench (55.1%) after optimizing for 24+ hours

AutoAgent 在优化 24+ 小时后，同时在 SpreadsheetBench (96.5%) 和 TerminalBench (55.1%) 的 GPT-5 分数上达到了 #1

every other entry on those leaderboards was hand-engineered. ours wasn't.

那些排行榜上的其他所有条目都是手工设计的。我们的是自主的。

agents have been bottlenecked by harness engineering, yet we're still doing primitive grid search: tweak, eval, read error traces, repeat

代理一直被 harness 工程所瓶颈，然而我们仍在进行原始的网格搜索：调整、评估、阅读错误追踪、重复

this is the first concrete evidence that an agent can autonomously beat manual harness tuning on production benchmarks.

这是第一个具体证据，证明一个代理可以在生产基准测试上自主击败手动 harness 调优。

code is available here

代码可在此处获取

here's what it does

它做了什么

point AutoAgent at a task domain with evals. a meta-agent experiments on a task agent's harness: tweaking prompts, adding tools, refining orchestration until performance climbs.

将 AutoAgent 指向一个带有 evals 的任务领域。meta-agent 在 task agent 的 harness 上实验：调整提示、添加工具、优化编排，直到性能提升。

the setup is minimal by design:

设计上是极简的：

the task agent starts with just a bash tool
任务 agent 仅从一个 bash 工具开始
program.md gives the meta-agent its research direction
program.md 为 meta-agent 提供其研究方向
agent.py is the task agent
agent.py 是任务代理
a Harbor adapter connects to your benchmark
Harbor 适配器连接到你的基准测试

the meta-agent then spins up 1000s of parallel sandboxes to improve the task agent. 24 hours later it has domain-specific tooling, verification loops, and orchestration logic. all discovered autonomously

然后元代理会启动数千个并行沙箱来改进任务代理。24 小时后，它就拥有了领域特定工具、验证循环和编排逻辑。全部自主发现

the loop:

循环：

1. edit the agent's harness 2. run it on tasks 3. measure performance 4. read failure traces 5. keep improvements, revert failures 6. repeat

1. 编辑智能体的测试框架 2. 在任务上运行它 3. 测量性能 4. 阅读失败痕迹 5. 保留改进，撤销失败 6. 重复

why this works: seeing like an agent

为什么奏效：像代理一样看待

we discovered agents are better at understanding agents tha...