超越极限的扩展:利用路由服务器实现稳定集群

At Zalando, we faced a critical challenge: our ingress controller was threatening to overload our Kubernetes cluster. We needed a solution that could handle the increasing traffic and scale efficiently. This is the story of how we implemented a Route Server to manage control plane traffic more effectively and ensure a stable cluster.

在 Zalando,我们面临一个关键挑战:我们的入口控制器威胁到我们的 Kubernetes 集群过载。我们需要一个能够处理不断增加的流量并高效扩展的解决方案。这是我们如何实施 Route Server 以更有效地管理控制平面流量并确保集群稳定的故事。

Skipper: Our Ingress Controller

Skipper:我们的 Ingress 控制器

We use Skipper, our HTTP reverse proxy for service composition, to implement the control plane and data plane of Kubernetes ingress and RouteGroups. A creation of an Ingress or RouteGroup will result in having an AWS LB 1 with TLS termination targeting Skipper via kube-ingress-aws-controller, HTTP routes at Skipper and a DNS name pointing to the LB via external-dns.

我们使用 Skipper,我们的 HTTP 反向代理用于服务组合,来实现 Kubernetes ingressRouteGroups 的控制平面和数据平面。创建一个 IngressRouteGroup 将导致创建一个 AWS LB 1,其 TLS 终止指向 Skipper,通过 kube-ingress-aws-controller,在 Skipper 上的 HTTP 路由,以及通过 external-dns 指向 LB 的 DNS 名称。

Ingress Stack

Ingress Stack

Ingress 堆栈

To understand the deployment context, this is the scale we operate at:

为了理解部署背景,这是我们运营的规模:

  • 15,000 Ingresses and 5,000 RouteGroups.
  • 15,000 Ingresses 和 5,000 RouteGroups。
  • Traffic of up to 2,000,000 requests per second.
  • 每秒高达2,000,000个请求的流量。
  • 80-90% of our traffic are authenticated service to service calls with daily numbers between 500,000 and 1,000,000 rps across our service fleet in total.
  • 我们 80-90% 的流量是经过身份验证的服务到服务调用,日均在 500,000 到 1,000,000 rps 之间,覆盖我们的服务集群。
  • 200 Kubernetes clusters.
  • 200个Kubernetes集群。

The Challenge

挑战

Scaling Pain Points

扩展痛点

Skipper instances were fetching Ingresses and RouteGroups from the Kubernetes API, which worked well initially. But the rapid growth in Skipper instances, reaching approximately 180 per cluster, began to overwhelm our etcd infrastructure.

Skipper 实例最初从 Kubernetes API 获取 Ingress 和 RouteGroups,这一过程运行良好。但 Skipper 实例的快速增长,达到每个集群大约 180 个,开始压垮我们的 etcd 基础设施。

This overlo...

开通本站会员,查看完整译文。

inicio - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.0. UTC+08:00, 2025-02-22 16:37
浙ICP备14020137号-1 $mapa de visitantes$