在Uber的Apache Pinot™中构建区域故障恢复能力
ZFR (zone failure resilience) is a critical aspect of modern distributed systems, especially for real-time analytics platforms like Apache Pinot™ that power many Tier-0 use cases at Uber. As part of our regional resilience initiative, ensuring Pinot can withstand zone failures without impacting queries or ingestion is paramount. This blog details how we’ve achieved zone failure resilience in Pinot by leveraging its instance assignment capabilities, integrating with Uber’s in-house isolation group concept, and consequently accelerating our release processes.
ZFR(区域故障弹性)是现代分布式系统的一个关键方面,尤其是对于像 Apache Pinot™ 这样的实时分析平台,它为 Uber 的许多 Tier-0 用例提供支持。作为我们区域弹性计划的一部分,确保 Pinot 能够承受区域故障而不影响查询或摄取至关重要。本文详细介绍了我们如何通过利用 Pinot 的实例分配能力、与 Uber 的内部隔离组概念集成,从而实现 Pinot 的区域故障弹性,并加速我们的发布流程。
Initially, our Pinot clusters at Uber relied on two key strategies: tag-based instance assignment, which groups servers by tenant to ensure logical isolation, and balanced segment assignment, which spreads data segments evenly across those servers. While effective for distributing data evenly across servers within a tenant, this approach didn’t inherently guarantee distribution across different physical zones. If all instances assigned to a table, or all replicas of a segment, happened to be in a single zone, a failure in that zone would lead to significant service disruption.
最初,我们在Uber的Pinot集群依赖于两个关键策略:基于标签的实例分配,它通过租户对服务器进行分组以确保逻辑隔离,以及平衡的段分配,它将数据段均匀分布在这些服务器上。虽然这种方法在租户内有效地分配数据,但并不能保证在不同物理区域之间的分布。如果分配给一个表的所有实例,或一个段的所有副本恰好位于同一区域,那么该区域的故障将导致显著的服务中断。
Pool-based instance assignment allows us to organize servers into distinct pools. This strategy is primarily designed to accelerate no-downtime rolling restarts for large shared clusters by ensuring different replica groups are assigned to different server pools.
基于池的实例分配允许我们将服务器组织成不同的池。该策略主要旨在通过确保不同的副本组分配到不同的服务器池,从而加速大型共享集群的无停机时间滚动重启。
When combined with replica-group segment assignment, replicas of data segments are distributed across these defined pools (which correspond to zones). This means that even if one ...