公司:Uber
优步(英语:Uber,/ˈuːbər/)是一间交通网络公司,总部位于美国加利福尼亚州旧金山,以开发移动应用程序连结乘客和司机,提供载客车辆租赁及媒合共乘的分享型经济服务。乘客可以透过应用程序来预约这些载客的车辆,并且追踪车辆的位置。营运据点分布在全球785个大都市。人们可以透过网站或是手机应用程序进入平台。
优步的名称大多认为是源自于德文über,和over是同源,意思是“在…上面”。 (页面存档备份,存于互联网档案馆)
然而其营业模式在部分地区面临法律问题,其非典型的经营模式在部分地区可能会有非法营运车辆的问题,有部分国家或地区已立法将之合法化,例如美国加州及中国北京及上海。原因在于优步是将出租车行业转型成社群平台,叫车的客户透过手机APP(应用程序),就能与欲兼职司机的优步用户和与有闲置车辆的租户间三者联系,一旦交易成功即按比例抽佣金、分成给予反馈等去监管化的金融手法。
2019年5月10日,优步公司透过公开分发股票成为上市公司,但首日即跌破分发价。
据估算,优步在全球有1.1亿活跃用户,在美国有69%的市占率。优步亦在大中华区开展业务,目前优步已在香港和台湾建成主流召车服务平台,并于中国大陆通过换股方式持有该市场最大网约车出行平台滴滴出行母公司小桔科技17.7%经济权益。
Capacity Recommendation Engine: Throughput and Utilization Based Predictive Scaling
Capacity is a key component of reliability. Uber’s services require enough resources in order to handle daily peak traffic and to support our different kinds of business units. These services are deployed across different cloud platforms and data centers (“zones”). With manual capacity management, it often results in an over-provisioned capacity, which is insufficient for resource usage. Uber built an auto-scaling service, which is able to manage and adjust resources for thousands of micro services. Currently, our auto-scaling service is based on a pure utilization metric. We recently built a new system, Capacity Recommendation Engine (CRE), with a new algorithm that relies on throughput and utilization based scaling with machine learning modeling. The model provides us with the relationship between the golden signal metrics and service capacity. With reactive prediction, CRE helps us to estimate the zonal service capacity based on linear regression modeling and peak traffic estimation. Apart from capacity, the analysis report can also tell us different zonal service characteristics and performance regression. In this article, we will deep dive into CRE’s modeling and system architecture, and present some analysis of its results.
How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)
As part of Uber engineering’s wide efforts to reach profitability, recently our team was focused on reducing cost of compute capacity by improving efficiency. Some of the most impactful work was around GOGC optimization. In this blog we want to share our experience with a highly effective, low-risk, large-scale, semi-automated Go GC tuning mechanism.
Uber’s tech stack is composed of thousands of microservices, backed by a cloud-native, scheduler-based infrastructure. Most of these services are written in Go. Our team, Maps Production Engineering, has previously played an instrumental role in significantly improving the efficiency of multiple Java services by tuning GC. At the beginning of 2021, we explored the possibilities of having a similar impact on Go-based services. We ran several CPU profiles to assess the current state of affairs and we found that GC was the top CPU consumer for a vast majority of mission-critical services. Here is a representation of some CPU profiles where GC (identified by the runtime.scanobject method) is consuming a significant portion of allocated compute resources.
Cadence Multi-Tenant Task Processing
Cadence is a multi-tenant orchestration framework that helps developers at Uber to write fault-tolerant, long-running applications, also known as workflows. It scales horizontally to handle millions…
CRISP: Critical Path Analysis for Microservice Architectures
Uber’s backend is an exemplar of microservice architecture. Each microservice is a small, individually deployable program performing a specific business logic (operation). The microservice architecture is a type of distributed computing system, which is suitable for independent deployments and scaling of software programs, and so is widely used across modern service-oriented industries. Uber has a few thousand microservices interacting with one another via remote procedure calls (RPC).
A service request arriving at an entry point (aka end-point) to the Uber backend systems undergoes multiple “hops” through numerous microservice operations before being fully serviced. The life of a request results in complex microservice interactions. These interactions are deeply nested, asynchronous, and invoke numerous other downstream operations. As a result of this complexity, it is very hard to identify which underlying service(s) contribute to the overall end-to-end latency experienced by a top-level request. Answering this question is critical in many situations, for example:
- Identifying optimization opportunities for a top-level microservice
- Identifying common bottleneck operations affecting many services
- Setting appropriate time-to-live values for downstream RPC calls
- Diagnosing outages and error conditions
- Capacity planning and reduction
While latency is one of the metrics of interest, other metrics such as time-to-live, error rates, etc., also fall in the scope.
How Uber Migrated Financial Data from DynamoDB to Docstore
Each day, Uber moves millions of people around the world and delivers tens of millions of food and grocery orders. This generates a large number of financial transactions that need to be stored with provable completeness, consistency, and compliance.
LedgerStore is an immutable, ledger-style database storing business transactions. LedgerStore provides signing/sealing of data to guarantee data completeness/correctness, strongly consistent indexes, and automatic data tiering. LedgerStore uses DynamoDB as its storage backend. Running LedgerStore in production for almost 2 years at Uber scale, we’d amassed a large amount of data as trips and orders volume grew. Over this period of time we realized that operating LedgerStore with DynamoDB as a backend was becoming expensive. Also having different databases in our portfolio creates fragmentation and makes it difficult to operate.
Having first-hand experience building large scale storage systems at Uber, we decided to change the LedgerStore backend to be one of our in-house, homegrown databases. The 2 main principles we kept in mind were: 1) Efficiency, and 2) Technology consolidation. Following the first would yield us great results in the short term, while the second would put us on a solid long-term roadmap, having greater flexibility and operational ease.
In this post today we are going to talk about rearchitecting some of the core components of LedgerStore on top of Docstore, Uber’s general-purpose multi-model database.
Introducing uGroup: Uber’s Consumer Management Framework
Apache Kafka® is widely used across Uber’s multiple business lines. Take the example of an Uber ride: When a user opens up the Uber app, demand and supply data are aggregated in Kafka queues to serve fare calculations. When a ride request is accepted by a driver, push notifications in Kafka queue are sent to mobile devices. After a ride is finished, post-trip processing, including payment and receipt sending, leverages Kafka. During the entire operation, the data and messages flowing between services are also ingested into Apache Hive™ for data analytic purposes. In a word, Apache Kafka is a critical service that empowers Uber’s business.
Given its high popularity, we are operating large scale Kafka clusters across multiple regions. We started our Kafka journey in early 2015 with a few-node Kafka cluster in one region. With the tremendous growth of Uber’s business and expansion of Kafka usages, we ran into scaling and operational issues, and got many interesting user requests from customers.
One of the most common issues we have run into is how to efficiently monitor the state of a large number of consumers. Having evaluated many open source solutions, with the large scale and unique setup, we finally decided to build a new observability framework for monitoring the state of Kafka consumers. Today, we are delighted to introduce uGroup, our internal Kafka consumer monitoring service.
Improving HDFS I/O Utilization for Efficiency
Scaling our data infrastructure with lower hardware costs while maintaining high performance and service reliability has been no easy feat. To accommodate the exponential growth in both Data Storage and Analytics Compute at Uber, the Data Infrastructure team massively overhauled its approach in scaling the Apache Hadoop® Data File System (HDFS) by re-architecting the software layer in conjunction with a hardware redesign:
- HDFS Federation, Warm Storage, YARN co-location on HDFS data nodes and increased YARN utilization improved the systems’ CPU & Memory usage efficiency
- Combining multiple Hardware server designs (24 x 2TB HDD, 24 x 4TB HDD, 35 x 8TB HDD) into a unified design of 35 x 16TB HDD for 30% Hardware cost reduction
Building Uber’s Fulfillment Platform for Planet-Scale using Google Cloud Spanner
The Fulfillment Platform is a foundational Uber domain that enables the rapid scaling of new verticals. The platform handles billions of database transactions each day, ranging from user actions (e.g., a driver starting a trip) and system actions (e.g., creating an offer to match a trip with a driver) to periodic location updates (e.g., recalculating eligible products for a driver when their location changes). The platform handles millions of concurrent users and billions of trips per month across over ten thousand cities and billions of database transactions a day.
In the previous article, we introduced the Fulfillment domain, highlighted challenges in the previous architecture, and outlined the new architecture.
When designing the new architecture, we converged on leveraging Google’s Cloud Spanner, a NewSQL storage engine to satisfy the requirements of transactional consistency, horizontal scalability, and low operational overhead. This article describes how we leveraged Cloud Spanner for planet-scale architecture without sacrificing consistency guarantees and with low operational overhead.
Real-Time Exactly-Once Ad Event Processing with Apache Flink and Kafka
Uber recently launched a new capability: Ads on UberEats. With this new ability came new challenges that needed to be solved at Uber, such as systems for ad auctions, bidding, attribution, reporting, and more. This article focuses on how we leveraged open source technology to build Uber’s first “near real-time” exactly-once events processing system. We’ll dive into the details of how we achieved exactly-once processing as well as the inner workings of our event processing jobs.
YAML Generator for Funnel YAML Files: Streamlining the Mobile Data Workflow Process
At Uber, real-time mobile analytics events—generated by button taps, page views, and more—form the backbone of the mobile data workflow process.
To process these events, our Mobile Data Platform Team designed and developed the Fontana library, which converts the nearly-one-million-QPS (queries per second) volume of events into easily digestible and useful analytics for Uber engineers. As part of this process, funnel YAML files are key config files that are used to define sequences of events for analysis. To this end, our team has also designed and developed the SuperFlurry application, which aids in creating and managing these files.
However, SuperFlurry still required users to create and modify funnel YAML files by editing raw YAML files directly. This was a significant pain point for Uber engineers and PMs aiming to set up their own funnels, especially for those unfamiliar with the format or the specific structure of funnel YAML files, as subtle syntactical errors are easy to make and switching back and forth between file creation and documentation is time-consuming. To remedy this issue, we designed and developed YAML Generator, an application that provides a comprehensive set of options for creating a funnel YAML file alongside a clean and intuitive UI.
The newly developed YAML Generator application builds onto the SuperFlurry application, streamlining the creation of these funnel YAML files. Through the development of this application, the importance of simplifying user-side input and the importance of continuous feedback are highlighted.
Jellyfish: Cost-Effective Data Tiering for Uber’s Largest Storage System
Uber deploys a few storage technologies to store business data based on their application model. One such technology is called Schemaless, which enables the modeling of related entries in one single row of multiple columns, as well as versioning per column.
Schemaless has been around for a couple of years, amassing Uber’s data. While Uber is consolidating all the use cases on Docstore, Schemaless is still the source of truth for different pre-existing customer pipelines. As such, Schemaless uses fast (but expensive) underlying storage technology to enable millisecond-order latency at high QPS. Furthermore, Schemaless deploys a number of replicas per region to ensure data durability and availability in the face of different failure models.
Accumulating more data while using expensive storage, Schemaless has increasingly become a key concern for cost and thus required attention. To this end, we carried out measurements for understanding data access patterns. We found that data is frequently accessed for a period of time, after which it is accessed less frequently. The exact period varies from one use case to another, however, old data must still be readily available upon request.
Streaming Real-Time Analytics with Redis, AWS Fargate, and Dash Framework
Uber’s GSS (Global Scaled Solutions) team runs scaled programs for diverse products and businesses, including but not limited to Eats, Rides, and Freight. The team transforms Uber’s ideas into agile, global solutions by designing and implementing scalable solutions. One of the areas of expertise within GSS is the Digitization vertical. The Digitization team efficiently converts physical signals into digital assets and provides services in labeling, in-field testing, data curation and validations for maps, product incubation, freight BOL (bill of lading), Eats menu uploads, etc.
All these digitization services are performed by thousands of humans (operators) working on our internal applications across many locations around the globe. While an operator is digitizing data, our backend collects a clickstream of all the user interactions in the form of raw events to the scale of 10 million events per day in AWS (Amazon Web Services) cloud infrastructure. Sometimes this data is also moved to Uber’s own data centers. Our data analytics team performs analysis on this data to improve/tweak the process, augment tooling infrastructure, address operator motivation, and improve operator skills. Analytics is usually performed by querying big data lakes and using different frontend tools for visualisation. Generally, any analytics setup has a latency (source to user) component to it and the latency of our existing (pre-COVID) infrastructure was 1 hour. With the onset of COVID-19 crisis, the digitization process had to be transitioned to work-from-home mode, leading to additional operational complexity of remotely managing a huge workforce of operators. This complexity created a gap in team’s communication, decision making, and collaboration. Where 1-hour latency of our analytics platform was previously acceptable, real-time analytics was needed to fill this gap. This blog describes how we improved latency of our data architecture by building a real-time analytics system.
While we researched approaches used for building real-time dashboards (example), we did not find an end-to-end solution, considering how rich visualization can be achieved at lower cost. We considered different visualization approaches and also looked at commercial solutions to come up with our choices. Another differentiating aspect was that our solution also addresses the need for a “single source of truth” on Amazon S3 (Amazon’s “simple storage service”), from which both streaming and batch processed dashboards would to be sourced, rather than hooking directly into the Amazon Kinesis Data Firehose stream itself. This intermediate storage lets us recover data (for the streaming window) with a replay. We production tested our visualizations with thousands of users for low load times and reliability.
Enabling Seamless Kafka Async Queuing with Consumer Proxy
Uber has one of the largest deployments of Apache Kafka in the world, processing trillions of messages and multiple petabytes of data per day. As Figure 1 shows, today we position Apache Kafka as a cornerstone of our technology stack. It empowers a large number of different workflows, including pub-sub message buses for passing event data from the rider and driver apps, streaming analytics (e.g., Apache Flink, Apache Samza), streaming database changelogs to the downstream subscribers, and ingesting all sorts of data into Uber’s Apache Hadoop data lake.
How Data Shapes the Uber Rider App
Data is crucial for our products. Data analytics help us provide a frictionless experience to the people that use our services. It also enables our engineers, product managers, data analysts, and data scientists to make informed decisions. The impact of data analysis can be seen in every screen of our app: what is displayed on the home screen, the order in which products are shown, what relevant messages are shown to users, what is stopping users from taking rides or signing up, and so on.
With such a huge user base and wide range of features, support across all geographic regions is a complicated problem to solve. Furthermore, our app keeps expanding with new products, which mandates that the underlying tech also be flexible enough to evolve and support them.
Data is the primary tool enabling this. The following article will focus on rider data in particular: how we collect and process it, and how that has informed concrete improvements to the Rider app.
Powering the Network Pricing Model with Near Real-Time Features
Drivers within the same area may have quite different earnings, depending on the trips they take. For example, consider two hypothetical drivers in downtown San Francisco. Two riders request two rides: one is within downtown San Francisco, and the other is to Oakland, as shown in the image above. The distances for the two trips are similar. If we just price the trip based on distance, they will make the same amount of money for the current trip, while the driver going to Oakland will be less likely to get more trips there. Drivers tend to reject these trips if they have other choices. To reduce the variance of earnings for colocated drivers and the cancellation rate for trips going to non-busy areas, we price these trips differently, based on the network effect.
Both the rider and driver pricing flows are being changed to compute network adjustments in real time. Both these pricing systems receive adjustments based on a common network model, which returns the relative change in GB (Gross Bookings) of enabling a specific trip, compared with an average trip from that same origin.
The network model used requires some NRT (Near Real-Time) features. In this document, we will introduce some of the challenges we faced and how we solved them when building the real-time pipelines for computing and serving these features to online models.
Unifying Support Content to Enable More Empathetic and Personalized Customer Support Experiences
Content quality is critical to the support experienced by Uber’s customers. Consider an Eater who reached out for help to cancel a very delayed order. The same resolution, such as refunding the charge, can be delivered alongside a robotic-sounding message, or one where the style and tone of the response conveys true empathy and acknowledges the user’s poor experience on our platform.
As the natural expression of the support experience, and a bearer of brand promise, support content affects how people feel, and plays a major part in how they perceive the Uber brand. In addition, support content plays the role of educating users about the product behavior and about our policies while moving them to action. Finally, support content (such as knowledge base articles) also serves to deflect commonly asked questions and reduce the number of contacts handled by our agents. In short, support content is what a disgruntled customer first sees, and hence it is imperative that this content can placate and soothe the customer, whilst also resolving the core issue, transforming customer ire into customer delight.
Uber’s Customer Care platform currently supports content across different business verticals including Uber Mobility (Rider, Driver), Uber Delivery (Eater, Courier, and Merchants), Uber For Business (Organizations and Employees), Uber Freight (Carrier, Shipper), etc.