Deep Agents: The Harness Behind Claude Code, Codex, Manus, and OpenClaw

LangChain team’s agents moved from 52.8% to 66.5% on Terminal Bench 2.0, a jump from outside the Top 30 to the Top 5, only by changing the harness, not the model.

In a moment, I’ll walk you through the biggest lessons and hard-won best practices for building agent harnesses, drawing both from my own experience and from the work of frontier teams like Anthropic, OpenAI, and LangChain.

But first, a story that explains why this matters.

A year ago, I was building an agent for a client that needed to optimize live marketing campaigns over long-running execution windows.

This is what the overall solution looked like.

The task initially sounded straightforward: ingest campaign performance data, generate recommendations, apply budget and targeting adjustments, monitor outcomes, and keep iterating until the campaign hit its efficiency goals.

I had a strong model, clean tools, and a workflow that looked solid.

Just ship it, right?

It worked beautifully right up until reality showed up.

The job didn’t finish in one neat session.

It ran for hours.

Sometimes it had to wait on delayed…