Uber deploys a few storage technologies to store business data based on their application model. One such technology is called Schemaless, which enables the modeling of related entries in one single row of multiple columns, as well as versioning per column.
Schemaless has been around for a couple of years, amassing Uber’s data. While Uber is consolidating all the use cases on Docstore, Schemaless is still the source of truth for different pre-existing customer pipelines. As such, Schemaless uses fast (but expensive) underlying storage technology to enable millisecond-order latency at high QPS. Furthermore, Schemaless deploys a number of replicas per region to ensure data durability and availability in the face of different failure models.
Accumulating more data while using expensive storage, Schemaless has increasingly become a key concern for cost and thus required attention. To this end, we carried out measurements for understanding data access patterns. We found that data is frequently accessed for a period of time, after which it is accessed less frequently. The exact period varies from one use case to another, however, old data must still be readily available upon request.
Uber’s GSS (Global Scaled Solutions) team runs scaled programs for diverse products and businesses, including but not limited to Eats, Rides, and Freight. The team transforms Uber’s ideas into agile, global solutions by designing and implementing scalable solutions. One of the areas of expertise within GSS is the Digitization vertical. The Digitization team efficiently converts physical signals into digital assets and provides services in labeling, in-field testing, data curation and validations for maps, product incubation, freight BOL (bill of lading), Eats menu uploads, etc.
All these digitization services are performed by thousands of humans (operators) working on our internal applications across many locations around the globe. While an operator is digitizing data, our backend collects a clickstream of all the user interactions in the form of raw events to the scale of 10 million events per day in AWS (Amazon Web Services) cloud infrastructure. Sometimes this data is also moved to Uber’s own data centers. Our data analytics team performs analysis on this data to improve/tweak the process, augment tooling infrastructure, address operator motivation, and improve operator skills. Analytics is usually performed by querying big data lakes and using different frontend tools for visualisation. Generally, any analytics setup has a latency (source to user) component to it and the latency of our existing (pre-COVID) infrastructure was 1 hour. With the onset of COVID-19 crisis, the digitization process had to be transitioned to work-from-home mode, leading to additional operational complexity of remotely managing a huge workforce of operators. This complexity created a gap in team’s communication, decision making, and collaboration. Where 1-hour latency of our analytics platform was previously acceptable, real-time analytics was needed to fill this gap. This blog describes how we improved latency of our data architecture by building a real-time analytics system.
While we researched approaches used for building real-time dashboards (example), we did not find an end-to-end solution, considering how rich visualization can be achieved at lower cost. We considered different visualization approaches and also looked at commercial solutions to come up with our choices. Another differentiating aspect was that our solution also addresses the need for a “single source of truth” on Amazon S3 (Amazon’s “simple storage service”), from which both streaming and batch processed dashboards would to be sourced, rather than hooking directly into the Amazon Kinesis Data Firehose stream itself. This intermediate storage lets us recover data (for the streaming window) with a replay. We production tested our visualizations with thousands of users for low load times and reliability.
Uber has one of the largest deployments of Apache Kafka in the world, processing trillions of messages and multiple petabytes of data per day. As Figure 1 shows, today we position Apache Kafka as a cornerstone of our technology stack. It empowers a large number of different workflows, including pub-sub message buses for passing event data from the rider and driver apps, streaming analytics (e.g., Apache Flink, Apache Samza), streaming database changelogs to the downstream subscribers, and ingesting all sorts of data into Uber’s Apache Hadoop data lake.
Data is crucial for our products. Data analytics help us provide a frictionless experience to the people that use our services. It also enables our engineers, product managers, data analysts, and data scientists to make informed decisions. The impact of data analysis can be seen in every screen of our app: what is displayed on the home screen, the order in which products are shown, what relevant messages are shown to users, what is stopping users from taking rides or signing up, and so on.
With such a huge user base and wide range of features, support across all geographic regions is a complicated problem to solve. Furthermore, our app keeps expanding with new products, which mandates that the underlying tech also be flexible enough to evolve and support them.
Data is the primary tool enabling this. The following article will focus on rider data in particular: how we collect and process it, and how that has informed concrete improvements to the Rider app.
Drivers within the same area may have quite different earnings, depending on the trips they take. For example, consider two hypothetical drivers in downtown San Francisco. Two riders request two rides: one is within downtown San Francisco, and the other is to Oakland, as shown in the image above. The distances for the two trips are similar. If we just price the trip based on distance, they will make the same amount of money for the current trip, while the driver going to Oakland will be less likely to get more trips there. Drivers tend to reject these trips if they have other choices. To reduce the variance of earnings for colocated drivers and the cancellation rate for trips going to non-busy areas, we price these trips differently, based on the network effect.
Both the rider and driver pricing flows are being changed to compute network adjustments in real time. Both these pricing systems receive adjustments based on a common network model, which returns the relative change in GB (Gross Bookings) of enabling a specific trip, compared with an average trip from that same origin.
The network model used requires some NRT (Near Real-Time) features. In this document, we will introduce some of the challenges we faced and how we solved them when building the real-time pipelines for computing and serving these features to online models.
Content quality is critical to the support experienced by Uber’s customers. Consider an Eater who reached out for help to cancel a very delayed order. The same resolution, such as refunding the charge, can be delivered alongside a robotic-sounding message, or one where the style and tone of the response conveys true empathy and acknowledges the user’s poor experience on our platform.
As the natural expression of the support experience, and a bearer of brand promise, support content affects how people feel, and plays a major part in how they perceive the Uber brand. In addition, support content plays the role of educating users about the product behavior and about our policies while moving them to action. Finally, support content (such as knowledge base articles) also serves to deflect commonly asked questions and reduce the number of contacts handled by our agents. In short, support content is what a disgruntled customer first sees, and hence it is imperative that this content can placate and soothe the customer, whilst also resolving the core issue, transforming customer ire into customer delight.
Uber’s Customer Care platform currently supports content across different business verticals including Uber Mobility (Rider, Driver), Uber Delivery (Eater, Courier, and Merchants), Uber For Business (Organizations and Employees), Uber Freight (Carrier, Shipper), etc.
Our engineers have the responsibility of ensuring a consistent and positive experience for our riders, drivers, eaters, and delivery/restaurant partners.
Ensuring such an experience requires reliable systems: our apps have to work when anyone needs them. A major component of reliability is having engineers on call to deal with problems immediately as they arise. We set up our on-call engineers for success through training, tools, and processes.
In this article we will provide an overview of how we at the Eats Safety team ensure that our engineers are fully equipped to provide prompt, high-quality service—anywhere in the world, 24/7.
With Uber’s business growth and the fast adoption of big data and AI, Big Data scaled to become our most costly infrastructure platform. To reduce operational expenses, we developed a holistic framework with 3 pillars: platform efficiency, supply, and demand (using supply to describe the hardware resources that are made available to run big data storage and compute workload, and demand to describe those workloads). In this post, we will share our work on managing supply and demand. For more details about the context of the larger initiative and improvements in platform efficiency, please refer to our earlier posts: Challenges and Opportunities to Dramatically Reduce the Cost of Uber’s Big Data, and Cost-Efficient Open Source Big Data Platform at Uber.
As Uber’s business has expanded, the underlying pool of data that powers it has grown exponentially, and thus ever more expensive to process. When Big Data rose to become one of our largest…
Big data is at the core of Uber’s business. We continue to innovate and provide better experiences for our earners, riders, and eaters by leveraging big data, machine learning, and artificial intelligence technology. As a result, over the last four years, the scale of our big data platform multiplied from single-digit petabytes to many hundreds of petabytes.
Uber’s big data stack is built on top of the open source ecosystem. We run some of the largest deployments of Hadoop, Hive, Spark, Kafka, Presto, and Flink in the world. Open source software allows us to quickly scale up to meet Uber’s business needs without reinventing the wheel.
The cost of running our big data platform also rose significantly in that same period. The Big Data Platform was one of the most costly among the 3 internal platforms at Uber. That was when we started taking a serious look at our big data platform’s cost, aiming to reduce overhead while preserving the reliability, productivity and the value it provides to the business.
Uber delivers efficient and reliable transportation across the global marketplace, which is powered by hundreds of services, machine learning models, and tens of thousands of datasets. While growing rapidly, we’re also committed to maintaining data quality, as it can greatly impact business operations and decisions. Without data quality guarantees, downstream service computation or machine learning model performance quickly degrade, which requires a lot of laborious manual efforts to investigate and backfill poor data. In the worst cases, degradations could go unnoticed, silently resulting in inconsistent behaviors.
This led us to build a consolidated data quality platform (UDQ), with the purpose of monitoring, automatically detecting, and handling data quality issues. With the goal of building and achieving data quality standards across Uber, we have supported over 2,000 critical datasets on this platform, and detected around 90% of data quality incidents. In this blog, we describe how we created data quality standards at Uber and built the integrated workflow to achieve operational excellence.
For a company of our size and scale, robust, accurate, and compliant accounting and analytics are a necessity, ensuring accurate and granular visibility into our financials, across multiple lines of business.
Most standard, off-the-shelf finance engineering solutions cannot support the scale and scope of the transactions on our ever-growing platform. The ride-sharing business alone has over 4 billion trips per year worldwide, which translates to more than 40 billion journal entries (financial microtransactions). Each of these entries has to be produced in accordance with Generally Accepted Accounting Principles (GAAP), and managed in an idempotent, consistent, accurate, and reproducible manner.
To meet these specific requirements, we built an in-house Uber’s Finance Computation Platform (FCP) —a solution designed to accommodate our scale, while providing strong guarantees on accuracy and explainability. The same solution also serves in obtaining insights on business operations.
There were many challenges in building our financial computation platform, from our architectural choices to the types of controls for accuracy and explainability.
Apache Pinot is an open source data analytics engine (OLAP), which allows users to query data ingested from as recently as a few seconds ago to as old as a few years back. Pinot’s ability to ingest real-time data and make them available for low-latency queries is the key reason why it has become an important component of Uber’s data ecosystem. Many products built in Uber require real-time data analytics to operate in our mobile marketplace for shared rides and food delivery. For example, the chart in Figure 1 shows the breakdown of Uber Eats job states over a period of minutes. Our Uber Eats city operators need such insights to balance marketplace supply and demand, and detect ongoing issues.
Uber’s mission is to help our consumers effortlessly go anywhere and get anything in thousands of cities worldwide. At its core, we capture a consumer’s intent and fulfill it by matching it with the right set of providers.
Fulfillment is the “act or process of delivering a product or service to a customer.” The Fulfillment organization at Uber develops platforms to orchestrate and manage the lifecycle of ongoing orders and user sessions with millions of active participants.
In 2019, we started a journey to re-architect the Hadoop deployment stack. Fast forward 2 years, over 60% of Hadoop runs in Docker containers, bringing major operational benefits to the team. As a result of the initiative, the team handed off many of their responsibilities to other infrastructure teams, and was able to focus more on core Hadoop development.
This article provides a summary of problems we faced, and how we solved them along the way.
By its nature, Uber’s business is highly real-time and contingent upon geospatial data. PBs of data are continuously being collected from our drivers, riders, restaurants, and eaters. Real-time analytics over this geospatial data could provide powerful insights.
In this blog, we will highlight the Orders near you feature from the Uber Eats app, illustrating one example of how Uber generates insights across our geospatial data.
Orders near you was a recent collaboration between the Data and Uber Eats teams at Uber. The project’s goal was to create an engaging and unique social experience for eaters. We hoped to inspire new food and restaurant discovery by showing what your neighbors are ordering right now. Since this feature is part of our home feed, we needed it to be fast, personalized, and scalable.