The Accounter: 在 Uber 的有状态平台上扩展操作吞吐量

In a previous post, we introduced Uber’s stateful platform, Odin. We discussed how the platform’s scale and the growing need for fleet-wide operations required better coordination among its many remediation loops. Multiple conflicting operations could compromise storage clusters without centralized coordination, leading to availability or durability issues. As shown in Figure 1, when uncoordinated remediation loops operate a quorum-based storage cluster, it causes problems. This post explores how we overcame this problem and scaled Odin’s throughput by introducing global coordination of all operations.

之前的文章中,我们介绍了Uber的有状态平台Odin。我们讨论了平台的规模和对全车队操作日益增长的需求如何需要更好的协调其众多修复循环。多个冲突的操作可能会在没有集中协调的情况下危及存储集群,导致可用性或持久性问题。如图1所示,当不协调的修复循环操作基于法定人数的存储集群时,会引发问题。这篇文章探讨了我们如何克服这个问题,并通过引入所有操作的全球协调来扩展Odin的吞吐量。

Image

Figure 1: Example of conflicting operations resulting in cluster availability loss.

图1:导致集群可用性丧失的冲突操作示例。

Operations on Odin are implemented using Cadence workflows. When an actor, whether human or automated, wants to operate one of the managed storage clusters, it does so through workflows. A workflow consists of actions, like changes to the system state, and waiting periods, like waiting for system converges, that collectively orchestrate transitioning the system from one state to another. Workflow executions can range from seconds, such as upgrading container images, to hours, like migrating workloads between hosts (Uber’s fleet uses locally attached disks). We’ll refer to these workflows as operations from this point forward.

Odin上的操作是使用Cadence工作流实现的。当一个行为者,无论是人类还是自动化的,想要操作一个受管理的存储集群时,它通过工作流来实现。一个工作流由一系列动作组成,比如系统状态的变化,以及等待期,比如等待系统收敛,这些共同协调系统从一个状态过渡到另一个状态。工作流执行时间可以从几秒钟(如升级容器镜像)到几小时(如在主机之间迁移工作负载,Uber的车队使用本地附加磁盘)。从现在开始,我们将这些工作流称为操作。

We needed a mechanism to gate the initiation of new operations or, to put it another way, answer the question: Given the current circumstances, is it safe to proceed with this operation on this cluster?

我们需要一种机制来限制新操作的启动,或者换句话说,回答问题:在当前情况下,在这个集群上进行此操作是否安全?

Our design requirements were as follows:

我们的设...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.147.0. UTC+08:00, 2025-10-27 00:24
浙ICP备14020137号-1 $访客地图$