Challenges and Opportunities to Dramatically Reduce the Cost of Uber’s Big Data
Big data is at the core of Uber’s business. We continue to innovate and provide better experiences for our earners, riders, and eaters by leveraging big data, machine learning, and artificial intelligence technology. As a result, over the last four years, the scale of our big data platform multiplied from single-digit petabytes to many hundreds of petabytes.
Uber’s big data stack is built on top of the open source ecosystem. We run some of the largest deployments of Hadoop, Hive, Spark, Kafka, Presto, and Flink in the world. Open source software allows us to quickly scale up to meet Uber’s business needs without reinventing the wheel.
The cost of running our big data platform also rose significantly in that same period. The Big Data Platform was one of the most costly among the 3 internal platforms at Uber. That was when we started taking a serious look at our big data platform’s cost, aiming to reduce overhead while preserving the reliability, productivity and the value it provides to the business.
At the end of 2019, we exceeded our goal by saving the company many millions of dollars worth of hardware. In early 2020, we developed a roadmap for the next 2 years, with the goal to cut down our big data cost even more. Over the following three posts we will outline how we successfully brought down Uber’s big data overhead from a top-level perspective, illustrate particular examples of what worked, and lay out plans for future development and research in this area.
Motivation and Background
Uber’s Exponential Business Growth
Uber started in 2010, and completed the first billion trips around the end of 2015. Not long after, by mid-2018, that number had already risen 10 billion trips. These trips happened in more than 21 countries across 5 continents. The lines of business at Uber also expanded significantly and the product became more sophisticated. This resulted in exponential growth of the data being logged, along with a demand for compute capacity to analyze that data.
Next-Generation ML and AI Applications
On the other hand, Uber’s exponential growth wasn’t possible without our work on leveraging next-generation ML and AI technologies to optimize our product and user experiences. In order to provide a great experience for Uber’s users, many decisions need to be made automatically and accurately based on data. From estimating the demand and the supply of Uber’s marketplace to figuring out the ETA of a driver to pick up a rider, big data plays a very important role.
With the business use cases above, the number of ML and AI technologies created and adopted at Uber has proliferated. Uber built and open-sourced Horovod, our distributed Deep Learning framework, back in late 2017. Uber’s homegrown ML platform called Michelangelo has also scaled, evolved, and expanded significantly over time. These use cases have significantly increased both the storage and compute demand of our big data platform, in addition to Uber’s business growth.
Lower Priority of Cost Efficiency
Just as in any fast-scaling company, the Uber Big Data Platform team’s main focus was on how to enable new applications and use cases as soon as possible to unblock our internal customers. For example, there is a laser focus on scalability and availability to help our platform to scale by 100x with a proliferation of technologies. This enabled our platform’s users to get what they need with the same reliability as before. While those have helped Uber’s top-line revenue to grow, it implies that we wouldn’t have as much bandwidth or attention to the bottom line (the hardware cost in this case). When the footprint of big data at Uber was small 5 years ago, it was only a small part of Uber’s overall infrastructure cost. However, in 3 short years, Big Data Platform emerged as the most expensive infrastructure platform at Uber in early 2019.
As a result of all this growth, we started to seriously assess the Big Data Platform’s costs. Below we will describe the top challenges that we identified, and the overall strategy we devised to address them.
Challenges
On-Prem vs Cloud
Uber has largely been operating with on-prem data centers since the company was founded in 2010 with a small amount of cloud presence, usually in specialized use cases. Uber’s Big Data Platform is mostly on-prem as well.
The first challenge we identified is to figure out whether or not moving to the Cloud will significantly reduce the hardware cost. This is a hot topic in the industry, as we have seen many companies moving into the Cloud, and also some big players moving out of the Cloud, at least partially, and cost was one of the big considerations.
So is On-Prem or Cloud cheaper? In short we found that there is no one-size-fits-all answer, and later we will share our thinking on leveraging the strengths of both On-Prem and Cloud to lower the overall cost.
Figure 1: On-Prem vs Cloud
Multi-Tenant Problem
Uber’s Big Data Platform has thousands of internal users, ranging from deep technical experts to non-technical business users. These include backend engineers, product engineers, ML engineers, data engineers, data analysts, data scientists, product managers, business analysts, city operations, and marketing managers. They belong to different parts of the organization, working on different lines of businesses within Uber.
Such a large and heterogeneous user base negatively impacts cost efficiency for several reasons: first, we won’t be able to understand all the use cases deeply; second, it’s difficult to decide which users should get resources when we are constrained on Big Data Storage and Compute capacity; and third, lack of detailed ownership information and multi-level of dependencies mean that it’s hard to determine which users should be responsible or pay for the cost of the usage.
Figure 2: Multi-Tenant Problem
Disaster Recovery
To meet the necessary high availability as a transportation and delivery platform for the world, most of Uber’s infrastructure is based on an active-active architecture. Back in late 2016, Uber’s Big Data Platform extended from single-region to an active-active architecture. This addresses our need to support both Double Compute mode where a single job (e.g. ETL queries) is executed in both regions, and Single Compute + Replicate mode where a single job is executed in one region only, and the output data sets of the job (if any) are replicated to the other region.
The good thing about the active-active architecture is its high availability. In fact, we avoided 2 potentially serious outages in our Big Data Platform in 2017 alone because of our active-active architecture: one due to data center air conditioning issues in the hot summer, the other due to a manual operational mistake, both of which happened to one out of the two regions of our active-active architecture. However, active-active architecture also introduces a significant amount of cost inefficiency. For Double Compute, we are using double the amount of compute resources. For Single Compute + Replication, we still need to reserve the compute resource in the other region in case failover happens.
How can we avoid paying double the cost for Disaster Recovery? How can we leverage the additional compute capacity required by Disaster Recovery to run some of our workload? That’s exactly what this challenge is.
Figure 3: Disaster Recovery
P99 and Average Utilization
Reliable systems with low latency require a low P99 utilization number (hot machines among a cluster of machines) because once the P99 utilization is high, some requests or queries will take a lot longer due to the queuing effect. That can severely affect user experiences. As a result, most of our system’s footprint is determined by the capacity needed to reduce P99 utilization below a target number.
On the other hand, from the cost efficiency point of view, what matters is the average utilization of the hardware resources. For example, if the average utilization of CPU is 30%, and we find a way to increase it to 60% without affecting latency requirements, it usually means that we can cut down the capacity by half.
The challenge in P99 and average utilization is that there is usually a big gap between the two values. Sometimes the P99 utilization can be many times more than the average utilization. This applies to disk space utilization, CPU/memory utilization, network utilization, and also disk IOPS utilization, which affects the P99 request wait time.
Figure 4: P99 and Average 10-min moving-average CPU Util%, YARN Cluster
If we are able to effectively balance the workload across all machines for 24 hours a day, then P99 and average utilization can be pretty close. Given the fixed P99 target, the above implies that the average utilization can go up a lot, and we can thus reduce the size of our hardware fleet to gain cost efficiency.
Reducing the gap between P99 and average utilization is not going to be easy, however, which we will discuss in the subsequent solutions section.
Priority and Quality of Service
P99 is one way to measure the performance of a system under stress. Sometimes achieving low P99 of all workloads is not easy due to limited capacity and daily/weekly workload pattern. That’s when concepts like priority and quality of service come into the picture.
How do we prioritize jobs to offer a high quality of service for the most important jobs when hardware resources are constrained? How do we make sure that we have sufficient capacity to handle spikes in demand without incurring too much cost when idle? Can we provide the guaranteed quality of services to a subset of our workload as well as low-cost, opportunistic hardware resources for ad-hoc jobs? These are again hard questions to answer, but are critical to the cost efficiency efforts.
Figure 5: Priority and Quality of Service
Analytics CAP Equation
After reviewing the challenges above, we realized that there is something more fundamental here. Assuming that we have the ultimately optimized big data system, there will still be an unavoidable trade-off among Cost-Efficiency, Accuracy, and Performance. That’s what we call the Analytics CAP Equation:
Cost-Efficiency X Accuracy X Performance = Constant
For example, if we reduce the accuracy of query results using sampling, we can either improve performance of the query or the cost-efficiency of the query, or both. If we reduce performance of the query (thus let the query run for longer), we can either improve the accuracy of the query by processing 100% of the data, or improve the cost-efficiency via leveraging cheaper (but not guaranteed) compute resources, or both. If we reduce cost-efficiency (by having a higher budget for the query), we can either improve query accuracy by processing 100% of the data, or improve the performance via requesting more compute resources.
While this equation may seem obvious, there are practical challenges in leveraging this. First of all, how do we know what the user wants when a query is submitted? Can we guess the preference from the user, or do we need to ask the users to explicitly set it? Second, even with the same user preference, the global status of the Big Data clusters may influence the ideal choice. As an example, if the cluster is extremely busy, we would want to reduce accuracy in exchange for performance and cost-efficiency.
Strategy
The challenges above helped us realize that Big Data cost efficiency is really a holistic problem. It’s not sufficient to optimize in a single direction, because as soon as we squeeze out the inefficiency there, other challenges will become the biggest bottleneck of cost efficiency next.
However, it’s cumbersome to think about those challenges one by one without the risk of duplicate or blindspots. That’s why we developed with the following strategic framework to characterize the type of challenges we are facing:
Supply | Platform Efficiency | Demand |
On-Prem vs Cloud Disaster Recovery … | P99 and Average Utilization Analytics CAP Theorem … | Multi-Tenant Problem Priority and Quality of Service … |
Table 1: Cost Efficiency Strategy Framework
We often consider Uber as a transportation platform with supply, demand, and platform efficiency problems. Interestingly, our Big Data Platform is very similar. Here we use Supply to describe the hardware resources that are made available to run big data storage and compute workload. We use Demand to describe those workloads. Lastly, we use Platform Efficiency to describe the efficiency of our Big Data Platform in matching the supply and demand. All challenges, described in this article and others, can usually fit well into this strategic framework.
Summary
In this post, we went over the top challenges and opportunities in reducing the cost of Uber’s big data, including storage, scale, recovery, and utilization. Through analysis we realized that these problems were all interrelated, and that optimizing too much for one could come at the expense of the others, so we developed a holistic strategic framework for improving the cost efficiency of our big data, with 3 pillars: supply, platform efficiency, and demand. We will apply this framework and demonstrate some of the successful solutions that we have generated with it in an upcoming post.