用于长运行应用程序开发的 Harness 设计

Written by Prithvi Rajasekaran, a member of our Labs team.

作者：Prithvi Rajasekaran，我们 Labs 团队的成员。

Over the past several months I’ve been working on two interconnected problems: getting Claude to produce high-quality frontend designs, and getting it to build complete applications without human intervention. This work originated with earlier efforts on our frontend design skill and long-running coding agent harness, where my colleagues and I were able to improve Claude’s performance well above baseline through prompt engineering and harness design—but both eventually hit ceilings.

过去几个月，我一直在处理两个相互关联的问题：让 Claude 产生高质量的前端设计，以及让它在没有人类干预的情况下构建完整应用。这项工作源于我们早期的前端设计技能和长运行编码代理 harness 努力，在那里，我和同事们通过提示工程和 harness 设计将 Claude 的性能提升远超基线——但两者最终都遇到了天花板。

To break through, I sought out novel AI engineering approaches that held across two quite different domains, one defined by subjective taste, the other by verifiable correctness and usability. Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent. Building an evaluator that graded outputs reliably—and with taste—meant first developing a set of criteria that could turn subjective judgments like “is this design good?” into concrete, gradable terms.

为了突破，我寻找了适用于两个截然不同的领域的新颖 AI engineering 方法，一个由主观品味定义，另一个由可验证的正确性和可用性定义。从 Generative Adversarial Networks (GANs) 获得灵感，我设计了一个多代理结构，包括 generator 和 evaluator 代理。构建一个能够可靠地——并且有品味地——对输出进行评分的 evaluator，意味着首先开发一套标准，将主观判断如 “is this design good?” 转化为具体、可评分的术语。

I then applied these techniques to long-running autonomous coding, carrying over two lessons from our earlier harness work: decomposing the build into tractable chunks, and using structured artifacts to hand off context between sessions. The final result was a three-agent architecture—planner, generator, and evaluator—that produced rich full-stack applications over multi-hour autonomous coding sessions.

然后我将这些技术应用于长运行 autonomous coding，从我们早期的 harness 工作中带过两个经验：将构建分解为 tractable chunks，并使用 structured artifacts ...