OpenAl 如何大规模交付低延迟语音 Al
May 4, 2026
May 4, 2026
By Yi Zhang and William McDonald, Members of Technical Staff
作者:Yi Zhang 和 William McDonald,技术人员成员
Voice AI only feels natural if conversation moves at the speed of speech. When the network gets in the way, people hear it immediately as awkward pauses, clipped interruptions, or delayed barge-in. That matters for ChatGPT voice, for developers building with the Realtime API, for agents working in interactive workflows, and for models that need to process audio while a user is still talking.
语音 AI 只有在对话以语音速度进行时才感觉自然。当网络干扰时,人们会立即听到尴尬的停顿、被剪切的打断或延迟的介入。这对 ChatGPT voice、对使用 Realtime API 构建的开发者、对在交互式工作流中工作的代理以及需要在使用户仍在说话时处理音频的模型都很重要。
At OpenAI’s scale, that translates into three concrete requirements:
在 OpenAI 的规模下,这转化为三个具体需求:
- Global reach for more than 900 million weekly active users
- 为超过 9 亿周活跃用户提供全球覆盖
- Fast connection setup so a user can start speaking as soon as a session begins
- 快速连接建立,以便用户在会话开始时立即开始说话
- Low and stable media round-trip time, with low jitter and packet loss, so turn-taking feels crisp
- 低且稳定的媒体往返时间,低抖动和丢包率,因此轮流交替感觉干脆利落
The team at OpenAI responsible for real-time AI interactions recently rearchitected our WebRTC stack to address three constraints that started to collide at scale: one-port-per-session media termination does not fit OpenAI infrastructure well, stateful ICE (Interactive Connectivity Establishment) and DTLS (Datagram Transport Layer Security) sessions need stable ownership, and global routing has to keep first-hop latency low. In this post, we walk through the split relay plus transceiver architecture we built to preserve standard WebRTC behavior for clients while changing how packets are routed inside OpenAI’s infrastructure.
OpenAI 负责实时 AI 交互的团队最近重新架构了我们的 WebRTC 堆栈,以解决三个在规模化时开始冲突的约束:每个会话一个端口的媒体终止不适合 OpenAI 基础设施,有状态的 ICE (Interactive Connectivity Establishment) 和 DTLS (Datagram Transport Layer Security) 会话需要稳定的所有权,以及全局路由必须保持首跳延迟低。在本文中,我们将介绍我们构建的拆分 relay plus transceiver 架构,该架构在保留客户端标准 WebRTC 行为的同时,改变了 OpenAI’s 基础设施内部数据包的路由方式。