Deep Agents：Claude Code、Codex、Manus 和 OpenClaw 背后的 Harness

LangChain team’s agents moved from 52.8% to 66.5% on Terminal Bench 2.0, a jump from outside the Top 30 to the Top 5, only by changing the harness, not the model.

LangChain 团队的 agents 在 Terminal Bench 2.0 上从 52.8% 提高到 66.5%，从 Top 30 之外跃升到 Top 5，仅通过更改 harness，而不是模型。

In a moment, I’ll walk you through the biggest lessons and hard-won best practices for building agent harnesses, drawing both from my own experience and from the work of frontier teams like Anthropic, OpenAI, and LangChain.

稍后，我将带你走一遍最大的教训和构建 agent harnesses 的来之不易的最佳实践，这些既来自我自己的经验，也来自 Anthropic、OpenAI 和 LangChain 等前沿团队的工作。

But first, a story that explains why this matters.

但首先，一个解释为什么这重要的故事。

A year ago, I was building an agent for a client that needed to optimize live marketing campaigns over long-running execution windows.

一年前，我为一个客户构建了一个 agent，该客户需要在长时间执行窗口中优化实时营销活动。

This is what the overall solution looked like.

这就是整体解决方案的样子。

The task initially sounded straightforward: ingest campaign performance data, generate recommendations, apply budget and targeting adjustments, monitor outcomes, and keep iterating until the campaign hit its efficiency goals.

任务最初听起来很简单：摄入营销活动绩效数据，生成推荐，应用预算和目标调整，监控结果，并持续迭代直到活动达到其效率目标。

I had a strong model, clean tools, and a workflow that looked solid.

我有一个强大的模型、干净的工具，以及看起来很可靠的工作流程。

Just ship it, right?

就直接发布吧，对吧？

It worked beautifully right up until reality showed up.

它一直运行得完美无缺，直到现实出现。

The job didn’t finish in one neat session.

任务没有在一场整洁的会话中完成。

It ran for hours.

它运行了数小时。

Sometimes it had to wait on delayed…
有时它不得不等待延迟的…