每秒百万请求下的客户端负载均衡

Our busiest API ran its high-volume internal traffic through the cluster's shared edge ingress load balancer. For years we could never be sure whether a latency spike came from our own code or from reusing that shared edge router internally.
我们最繁忙的 API 通过集群共享的边缘入口负载均衡器运行其高并发内部流量。多年来,我们始终无法确定延迟峰值是源于我们自己的代码,还是源于内部复用该共享边缘路由器。
In a previous post, we described how we built Zalando's Product Read API (PRAPI), serving millions of requests per second with single-digit-millisecond latency across 25 European markets. Every product page, search result, and checkout depends on it. A brief degradation has measurable impact on sales, resulting in high performance and availability requirements. The low latency is achieved through consistent-hash routing: Skipper, the cluster's edge load balancer, routes the same product ID to the same pod(s), helping to leverage pod-local caches in the underlying application. The routing infrastructure for this API matters.
在上一篇文章中,我们描述了如何构建 Zalando 的产品读取 API(PRAPI),在 25 个欧洲市场以个位数毫秒的延迟每秒处理数百万次请求。每个产品页面、搜索结果和结账都依赖于它。短暂的降级会对销售产生可衡量的影响,从而带来高性能和高可用性的要求。低延迟是通过一致性哈希路由实现的:Skipper 作为集群的边缘负载均衡器,将相同的产品 ID 路由到相同的 pod,有助于利用底层应用程序中的 pod 本地缓存。该 API 的路由基础设施至关重要。
On launch, Skipper handled both edge routing and the internal traffic between our batching and single-get components. It was always my intention that client-side load balancing (CSLB) would replace the latter, and I had hoped it would be a fast-follow. But Skipper was fast, adding only a couple of hundred microseconds to each request, and the team was understandably reluctant to introduce significant change to a working system. Over the years, as incidents accumulated where the root cause was never quite clear (Skipper, or PRAPI?), it became harder to ignore the structural problem. For a single batch-of-100 request, PRAPI had a 100x exposure to Skipper. When Skipper sneezed, PRAPI got the flu.
在启动时,Skipper 同时处理边缘路由以及我们的批处理和 single-get 组件之间的内部流量。我一直打算用客户端负载均衡 (CSLB) 来替代后者,并希望它能快速跟进。但 Skipper 速度很快,每个请求只增加几百微秒,团队自然不愿意对一个正常工作的系统引入重大变更。多年来,随着根本原因始终不太明确(是...