Upgrading Uber’s MySQL Fleet to version 8.0

At Uber, our MySQL fleet is the backbone of our data infrastructure, supporting a vast array of operations critical to our platform. Starting 2023, we embarked on a significant journey to upgrade our MySQL fleet to the latest version (i.e., MySQL v8.0). 

In this blog post, we delve into the motivations, challenges, and solutions involved in this monumental upgrade process and how we completed this upgrade without impacting our Service Level Objectives (SLO).

Several compelling factors drove our decision to transition from MySQL v5.7 to v8.0:

Addressing End-of-Life Concerns: As MySQL v5.7 reached its extended support end date, continuing to use it exposed us to potential security vulnerabilities and a lack of ongoing bug fixes. This posed a significant risk to the stability and integrity of our data.

Boosting Performance and Concurrency: MySQL v8.0 offered a compelling proposition with its promise of substantial performance enhancements. Optimizations in indexing and resource utilization translated to faster query execution speeds and improved concurrency handling. This directly translates to a smoother user experience for our customers.

Unlocking New Functionality: Beyond performance improvements, v8.0 introduced valuable features like support for window functions, enhanced JSON handling, and better spatial data capabilities. These features opened new avenues for data manipulation and analysis, empowering us to unlock new functionalities within our platform.

Password Rotation: The introduction of “Dual passwords” in v8.0 allows for smoother password rotations during security incidents, minimizing service disruptions.

Streamlining Operational Efficiency: Managing schema changes is an ongoing task. The [Instant ADD Column feature](https://dev.mysql.com/blog-archive/mysql-8-0-instant-add-and-drop-columns/#:~:text=This%20feature%20enables%20users%20to,28%20(before%20MySQL%208.0.) in v8.0 significantly streamlined this process. This translates to reduced downtime during schema alterations, improving our overall operational efficiency.

Before delving into the details of our MySQL upgrade journey, it’s essential to grasp the scale and complexity of Uber’s MySQL infrastructure:

Scale: Uber’s MySQL infrastructure comprises over 2,100 clusters, distributed across 19 production zones spanning three regions. With over 16,000 nodes, our infrastructure forms the backbone of Uber’s data storage and processing capabilities.

Data Volume and Query Load: Supporting multiple Petabytes of data and serving approximately 3 million queries per second, our MySQL infrastructure handles a vast amount of data and traffic on a daily basis.

Clustered Architecture: Each MySQL cluster consolidates multiple MySQL processes running on individual nodes. While each node within a cluster contains identical data, they are strategically distributed across different data centers to ensure data availability and support failover mechanisms.

Primary-Secondary Replication: Within each cluster, a primary node manages all write traffic, while secondary nodes replicate data asynchronously. This architecture ensures redundancy and fault tolerance, allowing for seamless failover in the event of primary node failure.

Upgrade Considerations: Notably, while MySQL v5.7 primary to MySQL v8.0 read replica replication is compatible, the reverse scenario—MySQL v8.0 primary to MySQL v5.7 read replica replication—is not supported. This distinction played a crucial role in our upgrade planning and execution strategy.

The sheer scale of Uber’s MySQL infrastructure, with over 2,100 clusters and 16,000+ nodes spread across regions and zones, presented a significant challenge. Manual upgrades were simply not an option. To address this, we devised a comprehensive, multi-step upgrade strategy that could be executed efficiently across diverse environments, requiring meticulous coordination.

Another key concern was minimizing downtime during the upgrade process.  Maintaining Service Level Objectives (SLOs) and Service Level Agreements (SLAs) was paramount to ensure uninterrupted service for our users.  Our solution involved meticulous planning and a focus on minimizing downtime throughout the upgrade process.

Compatibility with existing applications and services was another hurdle. Ensuring seamless integration with our existing ecosystem necessitated extensive testing, including thorough validation and regression checks.

To further enhance system reliability and minimize service disruptions, we implemented automated rollback mechanisms. These mechanisms could automatically revert upgrades in case of failures or compatibility issues.

Finally, minimizing manual intervention during the upgrade process was crucial.  To streamline operations and reduce the risk of human error, we developed robust automated workflows. These workflows automated repetitive tasks, enabling seamless upgrades across thousands of clusters and nodes.

Overall, upgrading to v8.0 seemed like a huge win for everyone at Uber, as it promised a security boost, performance leap, and exciting new features. But manually tackling this across thousands of clusters? No, thanks! We needed a smarter solution–a solution that scaled. Enter our custom-built automation system, designed to guide each cluster meticulously through the multi-step upgrade process, all without a single human touch.

When we were considering an upgrade of our MySQL clusters from version 5.7 to 8.0, there were two possible approaches that we could have taken:

In a side-by-side upgrade, the new version of MySQL (in this case, v8.0) is installed alongside the existing version (v5.7). This approach involves setting up a separate server where the new version is deployed and configured. Once the new server is ready, traffic is gradually redirected to the new version, allowing for a smooth transition.

An in-place upgrade involves directly upgrading the existing MySQL installation to the new version (v8.0) without setting up a separate environment. This process typically requires stopping the MySQL service, performing the upgrade, and then restarting the service. In-place upgrades are simpler in terms of setup, but may involve longer downtime compared to side-by-side upgrades. Additionally, there is less room for rollback in case of unexpected issues during the upgrade process.

After careful consideration and thorough evaluation of the advantages and disadvantages, we made the decision to opt for a side-by-side upgrade approach from v5.7 nodes to v8.0, rather than pursuing an in-place upgrade. This choice was made in anticipation of the following benefits:

  1. Minimal downtime: With a side-by-side upgrade, we can keep the old MySQL 5.7 nodes running while we set up the new MySQL 8.0 nodes. This means we can gradually migrate the applications to the new nodes without any significant downtime.
  2. Reduced risk: Since the old MySQL 5.7 nodes remain operational, we can roll back to it if there are any issues with the new MySQL 8.0 nodes. This reduces the risk of performance degradation, data loss (only until the maintenance phase in the upgrade process) or other issues that may arise during the upgrade process.
  3. Better testing: By running the new MySQL 8.0 nodes alongside the old MySQL 5.7 nodes, we can test the new nodes with production read-only application load before making the switch. This can help us identify any issues and ensure that everything works as expected before we complete the migration.

Image

Figure 1: Side-by-side upgrade of MySQL cluster.

To address these challenges, we developed a system designed to completely automate the transition of a MySQL cluster from v5.7 to v8.0. Our automated alerts and monitoring system actively oversees the process to ensure a seamless transition and promptly alerts of any issues that may arise.

A high-level overview of the upgrade process includes:

  1. Node Replication: For each MySQL v5.7 node in the cluster, a corresponding MySQL v8.0 replica node is added in the same region/zone, maintaining the distribution consistency between v5.7 and v8.0 nodes.
  2. Soak Period: A monitoring period of approximately one week allows us to observe the system’s performance and detect any degradation or SLA breaches caused by the newer version nodes.
  3. Traffic Diversion: Once the soak period concludes, MySQL v5.7 replica nodes are disabled to divert traffic away from them.
  4. Primary Node Promotion: A MySQL v8.0 node is promoted to primary status for the cluster.
  5. Removal of Old Nodes: Finally, all MySQL v5.7 nodes are removed, completing the upgrade to MySQL v8.0.

The above process will be broken into 4 Stages:

Pre-Maintenance: During this stage, the cluster is prepared for upgrade by adding MySQL v8.0 nodes as replicas, which operate alongside existing v5.7 nodes to serve real production traffic.

Image

Figure 2: Pre Maintenance Stage.

System Monitoring: Newly added MySQL v8.0 nodes serve as replicas, allowing for real production traffic to be monitored. Any deviations from expected behavior are noted and addressed.

Maintenance: Once the system monitoring stage is successfully completed, a MySQL v8.0 node is promoted to primary status, and system stability is monitored.

Image

Figure 3: Maintenance Stage.

Post-Maintenance: In the final stage, non-replicating MySQL v5.7 nodes are deleted, resulting in a pure MySQL v8.0 cluster.

Image

Figure 4: Post Maintenance Stage.

While there was a gradual rollout strategy, we still needed the ability to rollback at every step and we needed the observability to identify signals to indicate when a rollback was needed. We prioritized minimizing risks and ensuring data integrity throughout the upgrade process. Until the Maintenance Step, all actions are fully reversible without any risk of data loss. Should our customers encounter service degradation due to factors like high latency or CPU usage, we can seamlessly and instantly revert to MySQL v5.7 with absolutely no data loss. This means that by simply deleting or disabling the MySQL v8.0 replica nodes introduced during the pre-maintenance stage, we can swiftly return to the previous state.However, it’s essential to note that once a MySQL v8.0 node is promoted to primary status, replication to a MySQL v5.7 node ceases. This transition marks a point of no return in terms of compatibility with MySQL v5.7.

Attempting to revert to a MySQL v5.7 primary after this stage would entail potential data loss, as any changes made on the MySQL v8.0 primary would not be replicated back to the MySQL v5.7 nodes. Therefore, careful consideration and thorough testing preceded the promotion of a MySQL v8.0 node to primary status, ensuring a smooth transition while safeguarding our data integrity.

We systematically advanced through each tier, commencing from tier 5 and descending to tier 0. At every tier, we organized the clusters into manageable batches, ensuring a systematic and controlled transition process. Before embarking on each stage of the version upgrade, we actively involved the on-call teams responsible for each cluster, fostering collaboration and ensuring comprehensive oversight.

This deliberate and structured methodology allowed us to effectively navigate the complexities inherent in upgrading our MySQL fleet. By prioritizing coordination, communication, and teamwork, we successfully traversed through each tier, seamlessly transitioning to MySQL version 8.0.

Upgrading to MySQL 8.0 brought not only new features, but also some unexpected tweaks in query execution plans for certain clusters. This resulted in increased latencies and resource consumption, potentially impacting user experience. This happened for the cluster which powers all the dashboards running at Uber. To address this issue, we collaborated with Percona, identified a patch fix, and successfully implemented it for the affected clusters. The resolution ensured the restoration of optimized query performance and resource efficiency in alignment with the upgraded MySQL version.

The transition from MySQL version 5.7 to version 8.0 introduced syntax changes for certain keywords, disrupting some queries in production. Additionally, a notable portion of our existing clusters did not have the “STRICT_TRANS_TABLES” SQL mode enabled, which is a default setting in MySQL 8.0. This absence resulted in errors for many customers during the upgrade procedure. Similarly, challenges emerged with the “ONLY_FULL_GROUP_BY” SQL mode, underscoring the necessity for meticulous configuration modifications to ensure compatibility with the specifications of the upgraded version. 

In MySQL 8.0, the default character set is utf8mb4, accompanied by the utf8mb4_0900_ai_ci collation. In contrast, the preceding MySQL 5.7 version employed the utf8mb4_unicode_520_ci collation, lacking support for the latest utf8mb4_0900_ai_ci. This transition introduced challenges in aligning collation settings across the upgraded system.

Library Upgrade Requirement: Many existing client libraries were incompatible with MySQL v8.0. To address this, we had to upgrade these libraries, conduct thorough testing to ensure their proper functionality in a staging environment, and subsequently proceed with the primary upgrade. This step was crucial to guarantee a seamless transition without compromising client interactions.

With the new version, we can harness the following performance improvements: 

29% improvement in p99 latency for 1 million inserts at 1024 threads.

Image

Figure 5: MySQL v5.7 v/s 8.0 at 1M inserts 1024 Threads.

33% improvement in p99 latency for 1 million reads at 1024 threads.

Image

Figure 6: MySQL v5.7 v/s 8.0 at 1M Reads 1024 Threads.

47% improvement in p99 latency for 1 million updates at 1024 threads.

Image

Figure 7: MySQL v5.7 v/s 8.0 at 1M Updates 1024 Threads.

~94% reduction in overall database lock time.

Image

Figure 8: Reduced locktime post upgrade.

~78% reduction in query time for some queries.

Image

Figure 9: Improved query time post upgrade.

Throughout the comprehensive upgrade journey, which spanned over a year, our dedicated team of engineers from the MySQL Team flawlessly navigated through a series of critical stages.

The monumental task of transitioning our entire fleet to MySQL 8.0 encompassed not only staging clusters, but also production clusters supporting Uber and internal tool instances. This extensive upgrade underscores the indispensable role played by our observability platform, testing regimen, and robust rollback capabilities.

Our meticulous testing procedures and phased rollout strategy proved invaluable, allowing us to unearth and address potential issues early on. By adopting this approach, we significantly mitigated the risk of encountering new failure modes during the primary upgrade phase.

The journey of upgrading our MySQL fleet at Uber to version 8.0 has been challenging but rewarding. By embracing the latest technology and leveraging automation, we’ve not only ensured the security and performance of our database infrastructure, but also demonstrated our commitment to innovation and excellence.

The upgrade process, meticulously planned and executed by our dedicated team of engineers, underscores our unwavering dedication to maintaining the highest standards of reliability and efficiency. Through careful consideration of the benefits and challenges, we successfully navigated the transition, mitigating risks and minimizing disruptions to our services.

As we reflect on this milestone achievement, we extend our gratitude to all those who contributed to the success of this endeavor. Together, we remain committed to pushing boundaries, driving innovation, and shaping the future of technology at Uber.

Home - Wiki
Copyright © 2011-2024 iteam. Current version is 2.139.0. UTC+08:00, 2024-12-23 03:11
浙ICP备14020137号-1 $Map of visitor$