高效、可靠的规模化计算集群管理

Efficient and Reliable Compute Cluster Management at Scale

Uber relies on a containerized microservice architecture. Our need for computational resources has grown significantly over the years, as a consequence of business’ growth. It is an important goal now to increase the efficiency of our computing resources. Broadly speaking, the efficiency efforts in compute cluster management involve scheduling more workloads on the same number of machines. This approach is based on the observation that the average CPU utilization of a typical cluster is far lower than the CPU resources that have been allocated to it. The approach we have adopted is to overcommit CPU resources, without compromising the reliability of the platform, which is achieved by maintaining a safe headroom at all times. Another possible and complementary approach is to reduce the allocations of services that are overprovisioned, which we also do. The benefit of overcommitment is that we are able to free up machines that can be used to run non-critical, preemptible workloads, without purchasing extra machines.

Uber依靠的是容器化的微服务架构。多年来,由于业务的增长,我们对计算资源的需求大幅增长。现在,提高我们的计算资源的效率是一个重要的目标。广义上讲,计算集群管理的效率努力涉及在相同数量的机器上调度更多的工作负载。这种方法是基于这样的观察:一个典型集群的平均CPU利用率远远低于分配给它的CPU资源。我们所采用的方法是在不影响平台可靠性的情况下,过度配置CPU资源,这是通过在任何时候都保持安全余量来实现的。另一个可能的补充方法是减少被超额配置的服务的分配,我们也是这样做的。超额配置的好处是,我们能够腾出机器,用于运行非关键的、可抢占的工作负载,而无需购买额外的机器。

In order to achieve this, we need a system that provides a real-time view of the CPU utilization for all hosts and all containers across all clusters. This system runs in production across all of our clusters, and is internally referred to as cQoS (Container Quality of Service). cQoS enables the scheduler to perform telemetry-aware scheduling decisions, such as load-aware placement of tasks, proactive elimination of hotspots in the cluster, and load-aware scaling of the cluster size. In addition to helping with efficient resource utilization, such a system also helps with container performance analysis. The per-container metrics help with identifying performance issues related to uneven load balancing and container right...

开通本站会员,查看完整译文。

Home - Wiki
Copyright © 2011-2024 iteam. Current version is 2.134.0. UTC+08:00, 2024-09-28 22:17
浙ICP备14020137号-1 $Map of visitor$