公司:Uber
优步(英语:Uber,/ˈuːbər/)是一间交通网络公司,总部位于美国加利福尼亚州旧金山,以开发移动应用程序连结乘客和司机,提供载客车辆租赁及媒合共乘的分享型经济服务。乘客可以透过应用程序来预约这些载客的车辆,并且追踪车辆的位置。营运据点分布在全球785个大都市。人们可以透过网站或是手机应用程序进入平台。
优步的名称大多认为是源自于德文über,和over是同源,意思是“在…上面”。 (页面存档备份,存于互联网档案馆)
然而其营业模式在部分地区面临法律问题,其非典型的经营模式在部分地区可能会有非法营运车辆的问题,有部分国家或地区已立法将之合法化,例如美国加州及中国北京及上海。原因在于优步是将出租车行业转型成社群平台,叫车的客户透过手机APP(应用程序),就能与欲兼职司机的优步用户和与有闲置车辆的租户间三者联系,一旦交易成功即按比例抽佣金、分成给予反馈等去监管化的金融手法。
2019年5月10日,优步公司透过公开分发股票成为上市公司,但首日即跌破分发价。
据估算,优步在全球有1.1亿活跃用户,在美国有69%的市占率。优步亦在大中华区开展业务,目前优步已在香港和台湾建成主流召车服务平台,并于中国大陆通过换股方式持有该市场最大网约车出行平台滴滴出行母公司小桔科技17.7%经济权益。
Data Race Patterns in Go
Uber has adopted Golang (Go for short) as a primary programming language for developing microservices. Our Go monorepo consists of about 50 million lines of code (and growing) and contains approximately 2,100 unique Go services (and growing).
Go makes concurrency a first-class citizen; prefixing function calls with the go keyword runs the call asynchronously. These asynchronous function calls in Go are called goroutines. Developers hide latency (e.g., IO or RPC calls to other services) by creating goroutines. Two or more goroutines can communicate data either via message passing (channels) or shared memory. Shared memory happens to be the most commonly used means of data communication in Go.
USL – Uber’s Unified Signup and Login Stack
Uber has operations in over 10,000 cities worldwide and its services include ridesharing, food delivery, package delivery, couriers, freight transportation, electric bicycle and motorized scooter rental, and ferry transport.
Every year we have millions of users going through signup and login on our various apps. Over the years we’ve built independent signup and login experiences for each of our lines of business which allowed us to innovate and move a lot quicker. However, as we scaled and added additional lines of business, our experiences began to diverge leading to some of these inconsistencies being amplified. USL (unified signup and login) enables the vision of “One Uber Identity” through a unified signup and login experience across all Uber apps.
Better Load Balancing: Real-Time Dynamic Subsetting
Subsetting is a common technique used in load balancing for large-scale distributed systems. In this blog post, we will briefly introduce Uber’s current service mesh architecture that has been powering thousands of critical microservices in Uber since 2016. We will then discuss the challenges we faced when trying to scale the number of tasks in the mesh and issues with our initial subsetting approach. We’ll finish with how we came up with the real-time dynamic subsetting solution and its results in production.
Dynamic Data Race Detection in Go Code
Uber has extensively adopted Go as a primary programming language for developing microservices. Our Go monorepo consists of about 50 million lines of code and contains approximately 2,100 unique Go services. Go makes concurrency a first-class citizen; prefixing function calls with the go keyword runs the call asynchronously. These asynchronous function calls in Go are called goroutines. Developers hide latency (e.g., IO or RPC calls to other services) by creating goroutines within a single running Go program. Goroutines are considered “lightweight”, and the Go runtime context switches them on the operating-system (OS) threads. Go programmers often use goroutines liberally. Two or more goroutines can communicate data either via message passing (channels) or shared memory. Shared memory happens to be the most commonly used means of data communication in Go.
A data race occurs in Go when two or more goroutines access the same memory location, at least one of them is a write, and there is no ordering between them, as defined by the Go memory model. Outages caused by data races in Go programs are a recurring and painful problem in our microservices. These issues have brought down our critical, customer-facing services for hours in total, causing inconvenience to our customers and impacting our revenue. In this blog, we discuss deploying Go’s default dynamic race detector to continuously detect data races in our Go development environment. This deployment has enabled detection of more than 2,000 races resulting in ~1,000 data races fixed by more than two hundred engineers.
Presto® on Apache Kafka® At Uber Scale
Uber’s goal is to ignite opportunity by setting the world in motion, and big data is a very important part of that. Presto® and Apache Kafka® play critical roles in Uber’s big data stack. Presto is the de facto standard for query federation that has been used for interactive queries, near-real-time data analysis, and large-scale data analysis. Kafka is the backbone for data streaming that supports many use cases such as pub/sub, streaming processing, etc. In the following article we will discuss how we have connected these two important services together to enable a lightweight, interactive SQL query directly over Kafka via Presto at Uber scale.
Securing Kafka® Infrastructure at Uber
Uber has one of the largest deployments of Apache Kafka® in the world. It empowers a large number of real-time workflows at Uber, including pub-sub message buses for passing event data from the rider and driver apps, as well as financial transaction events between the backend services. As Kafka forms a critical component of Uber’s core workflows, it is important to secure the data being published and subscribed from the topics to maintain the integrity of the data and to provide an access control mechanism for who can publish/subscribe to a given topic.
Uber’s Emergency Button and The Technologies Behind It
Safety has long been a top priority at Uber, as Uber’s CEO Dara Khosrowshahi wrote in ‘Raising the Bar on Safety’ in September 2018. In order to #StandForSafety, the team at Uber has rolled out a set of features, such as Safety Center, Trusted Contacts, and the in-app Emergency Button, among others.
The first version of Emergency Button was rolled out in India in 2015. The original system allowed riders and drivers to contact the local police authority while remaining inside the app, and automatically alerted regional support teams to proactively reach out to the user. In 2018, the team took the opportunity to improve the system with enhanced features, such as surfacing live location information in the app, sharing trip details with authorities, and making Emergency Button available on both the Rider and Driver apps in global markets. Since then, the team has made additional enhancements in select markets by introducing the ability for users to discreetly text local police instead of a call-only option. Also, in select markets we have launched an interactive voice response solution to provide follow-up to our users.
Avoiding CPU Throttling in a Containerized Environment
At Uber, all stateful workloads run on a common containerized platform across a large fleet of hosts. Stateful workloads include MySQL®, Apache Cassandra®, ElasticSearch®, Apache Kafka®, Apache HDFS™, Redis™, Docstore, Schemaless, etc., and in many cases these workloads are co-located on the same physical hosts.
With 65,000 physical hosts, 2.4 million cores, and 200,000 containers, increasing utilization to reduce cost is an important and continuous effort. Until recently efforts were blocked due to CPU throttling, which indicates that not enough resources have been allocated.
It turned out that the issue was how the Linux kernel allocates time for processes to run. In this post we will describe how switching from CPU quotas to cpusets (also known as CPU pinning) allowed us to trade a slight increase in P50 latencies for a significant drop in P99 latencies. This in turn allowed us to reduce fleet-wide core allocation by up to 11% due to less variance in resource requirements.
One Stone, Three Birds: Finer-Grained Encryption @ Apache Parquet™
Data access restrictions, retention, and encryption at rest are fundamental security controls. This blog explains how we have built and utilized open-sourced Apache Parquet™’s finer-grained encryption feature to support all 3 controls in a unified way. In particular, we will focus on the technical challenges of designing and applying encryption in a secure, reliable, and efficient manner. We will also share our experiences with recommended practices to manage the system in production and at scale.
Introducing Ballast: An Adaptive Load Test Framework
As Uber’s architecture has grown to encompass thousands of interdependent microservices, we need to test our mission-critical components at max load in order to preserve reliability. Accurate load testing allows us to validate if a set of services are working at peak usage and optimal efficiency while retaining reliability.
Load testing those services within a short time frame comes with its unique set of challenges. Most of these load tests historically involved writing, running, and supervising tests manually. Moreover, the degree to which tests accurately represent production traffic patterns gradually decreases over time as traffic organically evolves, imposing a long-term maintenance burden. The scope of the load testing effort continuously increases as the number of services grows, incurring a hidden cost to adding new features.
Introducing Carbon Feed for Earners: The One-Stop Info Shop
After launching the Driver App in 2018 to over 2 million earners worldwide, we added content and functionality at a rapid pace. Although this really bolstered the platform, allowing for high-density and high-frequency content, and provided drivers and couriers with new ways to earn, it came with significant costs. For earners, it became increasingly difficult to find the information that they needed to make key decisions about work at any given time. For internal teams within Uber trying to emit information, there was pressure to be loud across multiple channels in order to reach earners effectively; teams had to, in essence, compete against one another to ensure that their intended messages reached the right audiences. As a result, collective goals were inadvertently superseded by individual teams lacking context.
We created Carbon Feed to serve these 2 primary stakeholders: For earners, Feed should be the one-stop shop for all the information they need to make effective decisions. Having a single, unified surface where earners can compare their different options at a given time and assess the tradeoffs would be inherently valuable. For internal teams, Feed should enable them to easily convey useful information to earners through a visible, central surface. Making the process seamless by requiring teams to only go through a single, streamlined integration would help them communicate more effectively.
DeepETA: How Uber Predicts Arrival Times Using Deep Learning
At Uber, magical customer experiences depend on accurate arrival time predictions (ETAs). We use ETAs to calculate fares, estimate pickup times, match riders to drivers, plan deliveries, and more. Traditional routing engines compute ETAs by dividing up the road network into small road segments represented by weighted edges in a graph. They use shortest-path algorithms to find the best path through the graph and add up the weights to derive an ETA. But as we all know, the map is not the terrain: a road graph is just a model, and it can’t perfectly capture conditions on the ground. Moreover, we may not know which route a particular rider and driver will choose to their destination. By training machine learning (ML) models on top of the road graph prediction using historical data in combination with real-time signals, we can refine ETAs that better predict real-world outcomes.
Project RADAR: Intelligent Early Fraud Detection System with Humans in the Loop
Uber is a worldwide marketplace of services, processing thousands of monetary transactions every second. As a marketplace, Uber takes on all of the risks associated with payment processing. Uber partners who use the marketplace to provide services are paid for their work even if Uber was unable to collect the payment. Fraud response is thus a very important operational component of Uber’s global marketplace.
Industry-wide, payment fraud losses are measured in terms of the fraction of gross amounts processed. Though only a small fraction of gross bookings, these losses impact profits significantly. Furthermore, if a fraudulent activity is not discovered and mitigated immediately, it could soon be further exploited, resulting in serious losses for the company. These dynamics make early fraud detection vital to the company’s financial health.
Modern fraud detection systems are a combination of classic 1980s AI (also known as an “expert system”) and modern machine learning. We would like to share the journey on how we build the best-in-class automatic fraud detection system and process, leveraging both machine algorithms and human knowledge.
Cost Efficiency @ Scale in Big Data File Format
Our Apache Hadoop® based data platform ingests hundreds of petabytes of analytical data with minimum latency and stores it in a data lake built on top of the Hadoop Distributed File System (HDFS). We use Apache Hudi™ as our ingestion table format and Apache Parquet™ as the underlying file format. Our data platform leverages Apache Hive™, Apache Presto™, and Apache Spark™ for both interactive and long-running queries, serving the myriad needs of different teams at Uber.
Uber’s growth over the last few years exponentially increased both the volume of data and the associated access loads required to process it. As data volume grows, so do the associated storage and compute costs, resulting in growing hardware purchasing requirements, higher resource usage, and even causing out-of-memory (OOM) or high GC pause. The main goal of this blog is to address storage cost efficiency issues, but the side benefits also include CPU, IO, and network consumption usage.
We started several initiatives to reduce storage cost, including setting TTL (Time to Live) to old partitions, moving data from hot/warm to cold storage, and reducing data size in the file format level. In this blog, we will focus on reducing the data size in storage at the file format level, essentially at Parquet.
Capacity Recommendation Engine: Throughput and Utilization Based Predictive Scaling
Capacity is a key component of reliability. Uber’s services require enough resources in order to handle daily peak traffic and to support our different kinds of business units. These services are deployed across different cloud platforms and data centers (“zones”). With manual capacity management, it often results in an over-provisioned capacity, which is insufficient for resource usage. Uber built an auto-scaling service, which is able to manage and adjust resources for thousands of micro services. Currently, our auto-scaling service is based on a pure utilization metric. We recently built a new system, Capacity Recommendation Engine (CRE), with a new algorithm that relies on throughput and utilization based scaling with machine learning modeling. The model provides us with the relationship between the golden signal metrics and service capacity. With reactive prediction, CRE helps us to estimate the zonal service capacity based on linear regression modeling and peak traffic estimation. Apart from capacity, the analysis report can also tell us different zonal service characteristics and performance regression. In this article, we will deep dive into CRE’s modeling and system architecture, and present some analysis of its results.
How We Saved 70K Cores Across 30 Mission-Critical Services (Large-Scale, Semi-Automated Go GC Tuning @Uber)
As part of Uber engineering’s wide efforts to reach profitability, recently our team was focused on reducing cost of compute capacity by improving efficiency. Some of the most impactful work was around GOGC optimization. In this blog we want to share our experience with a highly effective, low-risk, large-scale, semi-automated Go GC tuning mechanism.
Uber’s tech stack is composed of thousands of microservices, backed by a cloud-native, scheduler-based infrastructure. Most of these services are written in Go. Our team, Maps Production Engineering, has previously played an instrumental role in significantly improving the efficiency of multiple Java services by tuning GC. At the beginning of 2021, we explored the possibilities of having a similar impact on Go-based services. We ran several CPU profiles to assess the current state of affairs and we found that GC was the top CPU consumer for a vast majority of mission-critical services. Here is a representation of some CPU profiles where GC (identified by the runtime.scanobject method) is consuming a significant portion of allocated compute resources.