使用 WebSocket 加速 Responses API 的智能体工作流

When you ask Codex to fix a bug, it scans through your codebase for relevant files, reads them to build context, makes edits, and runs tests to verify the fix worked. Under the hood, that means dozens of back-and-forth Responses API requests: determine the model’s next action, run a tool on your computer, send the tool output back to the API, and repeat.

当你要求 Codex 修复一个 bug 时，它会扫描你的代码库以查找相关文件，读取它们以构建上下文，进行编辑，并运行测试以验证修复有效。在底层，这意味着数十次来回的 Responses API 请求：确定模型的下一步行动、在你的计算机上运行工具、将工具输出发送回 API，并重复。

All of these requests can add up to minutes that users spend waiting for Codex to complete complex tasks. From a latency perspective, the Codex agent loop spends most of its time in three main stages: working in the API services (to validate and process requests), model inference, and client-side time (running tools and building model context). Inference is the stage where the model runs on GPUs to generate new tokens. In the past, running LLM inference on GPUs was the slowest part of the agentic loop, so API service overhead was easy to hide. As inference gets faster, the cumulative API overhead from an agentic rollout is much more notable.

所有这些请求加起来可能需要几分钟，用户在等待 Codex 完成复杂任务时会花费这些时间。从延迟角度来看，Codex 代理循环的大部分时间花费在三个主要阶段：API 服务中的工作（验证和处理请求）、模型推理，以及客户端时间（运行工具和构建模型上下文）。Inference 是模型在 GPUs 上运行以生成新 tokens 的阶段。在过去，在 GPUs 上运行 LLM 推理是代理循环中最慢的部分，因此 API 服务开销很容易隐藏。随着推理速度变快，代理 rollout 的累积 API 开销变得更加显著。

In this post, we'll explain how we made agent loops using the API 40% faster end-to-end, letting users experience the jump in inference speed from 65 to nearly 1,000 tokens per second. We approached this through caching, eliminating unnecessary network hops, improving our safety stack to quickly flag issues, and—most importantly—building a way to create a persistent connection to the Responses API, instead of having to make a series of synchronous API calls.

在这篇文章中，我们将解释我们如何让使用 API 的 agent loops 端到端快 40%，让用户体验推理速度从 65 到近 1,000 个 token 每秒的飞跃。我们通过缓存、消除不必要的网络跳跃、改进我们的安全栈以快速标记问题，以及—最重要的是—构建一种创建到 Responses API 的持久连接的方法，而不是进行一系列同步 API 调用，来实现这一点。

Diagram titled “A Codex agent loop in practice” showing an iterative flow between Codex and the Responses API, with tool calls (rg, sed, apply_patch, pytest) and results exchanged until the final message: “The bug has been fixed.”

使用 WebSocket 加速 Responses API 的智能体工作流

使用 WebSocket 加速 Responses API 的智能体工作流

...