Uber如何确保Apache Cassandra®对单区故障的容错性
Uber has been running an open-source Apache Cassandra® database as a service that powers a variety of mission-critical online transaction processing (OLTP) workloads for more than six years now at Uber scale, with millions of queries per second and petabytes of data. As Uber operates data centers in multiple zones across multiple regions, a Cassandra cluster at Uber typically has its nodes spread across multiple zones and regions. With high availability being essential for Uber’s business, we’d like to have Cassandra’s availability unaffected in the scenario of a single zone going down. This blog shows how we ensured the single-zone failure tolerance for Cassandra, and particularly how we converted the large Cassandra fleet in real-time with zero downtime from non-zone-failure-tolerant to single-zone-failure tolerant.
Uber已经将开源的Apache Cassandra®数据库作为一项服务运行,为Uber规模下的各种关键任务的在线事务处理(OLTP)工作负载提供支持已经超过六年,每秒处理数百万个查询和拥有PB级别的数据。由于Uber在多个区域的多个区域中心运营数据中心,Uber的Cassandra集群通常将其节点分布在多个区域和区域中。由于高可用性对Uber的业务至关重要,我们希望在单个区域故障的情况下,Cassandra的可用性不受影响。本博客展示了我们如何确保Cassandra的单区域故障容忍性,特别是如何将大型Cassandra集群实时转换为无需停机的单区域故障容忍性。
SZFT: Single Zone Failure Tolerant
SZFT:单区域故障容错
Cassandra naturally supports multiple copies of data. One of the biggest benefits of having multiple copies of data is high availability: if a minority of copies becomes unavailable, the majority of copies can still be accessed. When a Cassandra cluster is deployed across multiple availability zones, we would like to ideally have all the copies distributed evenly among the zones so that an impact to a zone does not impact user requests.
Cassandra自然支持多个数据副本。拥有多个数据副本的最大好处之一是高可用性:如果少数副本不可用,仍然可以访问大多数副本。当Cassandra集群部署在多个可用区时,我们希望理想情况下所有副本均匀分布在各个区域,以便区域的影响不会影响用户请求。
Figure 1: Single Zone Failure and Availability Impact.
图1:单区域故障和可用性影响。
Figure 1 illustrates the problem. In this example, the replication factor is 3. A data record is considered available when the majority of its copies are available. When zone 1 is down, data record 1 becomes unavailable because it loses the majorit...