垂直CPU扩展。降低容量成本,提高可靠性

Vertical CPU Scaling: Reduce Cost of Capacity and Increase Reliability

This blog post describes the implementation of an automated vertical CPU scaling system in which every storage workload running at Uber is allocated the ideal amount of cores. The framework is used today to right-size more than 500,000 Docker containers, and since its inception it has applied a net reduction of allocations of more than 120,000 cores, leading to annual multi-million dollar savings in infrastructure spending.

这篇博文描述了一个自动垂直CPU扩展系统的实施,其中在Uber运行的每个存储工作负载都被分配到理想的内核数量。今天,该框架被用来调整超过50万个Docker容器的大小,自成立以来,它已经应用了超过12万个核心的净减少分配,导致每年节省数百万美元的基础设施开支。

At Uber, we run all storage workloads such as Docstore, Schemaless, M3, MySQL®, Cassandra®, Elasticsearch®, etcd®, Clickhouse®, and Grail in a containerized environment. In total, we run more than 1,000,000 storage containers on close to 75,000 hosts with more than 2.5 million CPU cores. To reduce the risk of noisy neighbors, every workload is allocated an isolated set of CPU cores [ref], and hosts are not overprovisioned. We run a multi-region replicated setup where traffic can be drained from an entire region as part of incident response.

在Uber,我们在容器化环境中运行所有的存储工作负载,如DocstoreSchemalessM3、MySQL®、Cassandra®、Elasticsearch®、etcd®、Clickhouse®和Grail。总的来说,我们在接近75,000台主机上运行超过1,000,000个存储容器,CPU核数超过250万。为了减少噪音邻居的风险,每个工作负载都被分配了一组孤立的CPU核心[参考文献],并且主机没有被过度配置。我们运行一个多区域的复制设置,作为事件响应的一部分,流量可以从整个区域排出去。

Figure 1: Key metrics for Ubers stateful management platform.

图1:Ubers有状态管理平台的关键指标。

One major challenge is to assign the right number of CPU cores to every container. Until recently the appropriate core count to set per container was manually determined by engineers responsible for each storage technology. The advantage of this approach is that domain experts have the responsibility of monitoring each of their technologies and making the right decisions. The disadvantage is that humans need to do this work, and it often becomes a reactive scaling strategy where settings are changed when they cause cost or reliability issues, instead of a proactive approach where containers are ver...

开通本站会员,查看完整译文。

inicio - Wiki
Copyright © 2011-2025 iteam. Current version is 2.139.0. UTC+08:00, 2025-01-10 04:33
浙ICP备14020137号-1 $mapa de visitantes$