提高 MySQL® 集群正常运行时间:设计先进的检测、缓解和与组复制的共识
At Uber, engineers rely on MySQL® for applications that need relational databases. MySQL is the preferred choice for use cases that require ACID transactions and relational data modeling with a SQL interface. We support over 2,600 MySQL clusters.
在Uber,工程师依赖MySQL®来处理需要关系数据库的应用程序。MySQL是需要ACID事务和使用SQL接口进行关系数据建模的用例的首选。我们支持超过2600个MySQL集群。
MySQL clusters follow the topology of a single primary and multiple-replica model. The replication is by default asynchronous, where the replica nodes poll the binlogs from the primary server. Only the primary node is used to serve write requests, and the read requests are served from replicas in a round-robin fashion with region affinity. The number of replicas required in the cluster depends on the read volume served by the cluster.
MySQL 集群遵循单主多从的拓扑结构。默认情况下,复制是异步的,从节点从主服务器轮询 binlogs。只有主节点用于处理写请求,读请求则通过轮询方式从从节点提供,且具有区域亲和性。集群中所需的从节点数量取决于集群所服务的读取量。

Figure 1: MySQL cluster at Uber.
图1:Uber的MySQL集群。
In the world of online services, a database going down isn’t just a minor issue. It can lead to service disruptions, unhappy customers, and lost revenue. While our existing systems have served us well, we’ve identified key areas for improvement, particularly when it comes to keeping our databases online and available.
在在线服务的世界中,数据库宕机不仅仅是一个小问题。这可能导致服务中断、不满的客户和收入损失。虽然我们现有的系统为我们提供了良好的服务,但我们已经确定了关键的改进领域,特别是在保持我们的数据库在线和可用方面。
At Uber, HA (High Availability) isn’t just a goal—it’s the very foundation of our systems. Maintaining a single, healthy MySQL leader node is critical for seamless operation. In the past (from 2020-2023), our systems failed to detect and mitigate numerous incidents reported over a Slack support channel within the promised time frame, leaving services disrupted. This translated to minutes of unnecessary downtime and lost revenue.
在Uber,HA(高可用性)不仅仅是一个目标——它是我们系统的基础。维护一个健康的MySQL主节点对于无缝操作至关重要。在过去(从2020年到2023年),我们的系统未能在承诺的时间框架内检测和缓解通过Slack支持频道报告的众多事件,导致服务中断。这转化为几分钟的不必要停机和收入损失。
Our MySQL clusters relied on a setup with a single primary node and multiple read replicas. When the pri...