RL环境与代理能力的层次结构
2025 has been the year of agents, with AI moving out of the chat box and into the real world. But are we really close to having generally intelligent agents, or are they still a decade away? The trillion-dollar question: how much economically useful work can these agents actually do?
2025年是代理人的一年,人工智能从聊天框走入现实世界。但我们真的接近拥有通用智能代理人了吗,还是说还需要十年的时间?万亿美元的问题是:这些代理人实际上能做多少经济上有用的工作?
To answer that question, training and evaluation of models has shifted from rating individual responses to assessing multi-step tasks with tool use. For those involved in testing and post-training, 2025 is the year of RL environments: virtual worlds where models can act, experiment, and learn through realistic multi-step tasks.
为了回答这个问题,模型的训练和评估已经从对单个响应的评分转向评估使用工具的多步骤任务。对于参与测试和后期训练的人来说,2025年是RL环境:模型可以通过现实的多步骤任务进行行动、实验和学习的虚拟世界。
We "hired" nine AI models to perform 150 tasks in one of our RL environments. These were the results:
我们“雇佣”了九个AI模型在我们的一个RL环境中执行150个任务。这是结果:

Even GPT-5 and Claude Sonnet 4.5 failed over 40% of agentic tasks in one of our RL environments.
即使是 GPT-5 和 Claude Sonnet 4.5 在我们的一个 RL 环境中也未能完成超过 40% 的代理任务。
Two things are obvious:
两件事是显而易见的:
- GPT-5 and Claude Sonnet 4.5 are in a league of their own.
- GPT-5 和 Claude Sonnet 4.5 是独树一帜的。
- But even GPT-5 and Claude fail over 40% of tasks.
- 但即使是GPT-5和Claude也在超过40%的任务中失败。
The raw scores tell us who’s winning, but not why and how we can push forward. To understand what these results reveal about real-world agents, we need to look at how a realistic RL environment is built, or, more accurately, grown.
原始分数告诉我们谁在获胜,但并不说明原因以及我们如何向前推进。要理解这些结果揭示了关于现实世界代理的什么,我们需要看看如何构建一个现实的 RL 环境,或者更准确地说,是如何成长的。
Growing an RL environment
构建一个RL环境
Every RL environment needs three things:
每个RL环境需要三件事:
- A coherent world model: the big picture structure that defines the setting.
- 一个连贯的世界模型:定义环境的大局结构。
- A set of entities: the objects within the world and their relationships.
- 一组实体:世界中的对象及其关系。
- A tool system: the interface for agents to interact with the entities.
- 一个工具系统:代理与实体互动的界面。
To train models to become competent virtual ...