Deep Agents: The Harness Behind Claude Code, Codex, Manus, and OpenClaw
LangChain team’s agents moved from 52.8% to 66.5% on Terminal Bench 2.0, a jump from outside the Top 30 to the Top 5, only by changing the harness, not the model.
In a moment, I’ll walk you through the biggest lessons and hard-won best practices for building agent harnesses, drawing both from my own experience and from the work of frontier teams like Anthropic, OpenAI, and LangChain.
But first, a story that explains why this matters.
A year ago, I was building an agent for a client that needed to optimize live marketing campaigns over long-running execution windows.
This is what the overall solution looked like.

The task initially sounded straightforward: ingest campaign performance data, generate recommendations, apply budget and targeting adjustments, monitor outcomes, and keep iterating until the campaign hit its efficiency goals.
I had a strong model, clean tools, and a workflow that looked solid.
Just ship it, right?
It worked beautifully right up until reality showed up.
The job didn’t finish in one neat session.
It ran for hours.
- Sometimes it had to wait on delayed…