公司:Uber
优步(英语:Uber,/ˈuːbər/)是一间交通网络公司,总部位于美国加利福尼亚州旧金山,以开发移动应用程序连结乘客和司机,提供载客车辆租赁及媒合共乘的分享型经济服务。乘客可以透过应用程序来预约这些载客的车辆,并且追踪车辆的位置。营运据点分布在全球785个大都市。人们可以透过网站或是手机应用程序进入平台。
优步的名称大多认为是源自于德文über,和over是同源,意思是“在…上面”。 (页面存档备份,存于互联网档案馆)
然而其营业模式在部分地区面临法律问题,其非典型的经营模式在部分地区可能会有非法营运车辆的问题,有部分国家或地区已立法将之合法化,例如美国加州及中国北京及上海。原因在于优步是将出租车行业转型成社群平台,叫车的客户透过手机APP(应用程序),就能与欲兼职司机的优步用户和与有闲置车辆的租户间三者联系,一旦交易成功即按比例抽佣金、分成给予反馈等去监管化的金融手法。
2019年5月10日,优步公司透过公开分发股票成为上市公司,但首日即跌破分发价。
据估算,优步在全球有1.1亿活跃用户,在美国有69%的市占率。优步亦在大中华区开展业务,目前优步已在香港和台湾建成主流召车服务平台,并于中国大陆通过换股方式持有该市场最大网约车出行平台滴滴出行母公司小桔科技17.7%经济权益。
Simplifying Developer Testing Through SLATE
Modern-day technical system deployments generally follow SOA or a microservice-based architecture that allows for clearer separation of concerns, ownership, well-defined dependencies, and abstracts out a single unit of business logic.
Uber has thousands of services coordinating to power up the platform that drives the company at scale. To offer a great experience to customers, developers have to ensure that their service is meeting its functional requirements. Building confidence in meeting functional requirements requires testing of a service.
Introducing WorkflowGuard: The Workflow Governance and Observability System That Oversees over 120,000 Data Workflows
At Uber, over 120,000 production workflows are orchestrated, scheduled, and executed every day. These workflows are owned by over 3,000 users across many teams within Uber, powering critical ETL jobs, business metrics, dashboards, machine learning models, or critical regulatory reports. Internally, the Data Workflow Platform (DWP) team makes this happen by developing Uber’s centralized workflow management system with high infrastructure reliability and minimum scheduling latency. The workflow management system also comes with a user-friendly application that allows users to create, author, and manage both streaming and batch workflows in a self-serve way.
Reducing Logging Cost by Two Orders of Magnitude using CLP
Long, long ago, the amount of data our systems output to logs was small enough that we were able to retain all of the log files. This allowed our engineers to freely analyze the logs, say for troubleshooting our systems or improving applications. But as Uber’s business grew rapidly, the amount of data being logged increased dramatically. And so we were forced to discard log files after just a short period of time, given the prohibitive cost of retaining them–that is, until we integrated CLP into the logging library (Log4j) of our big data platform. In aggregate, CLP achieves a 169x compression ratio on our log data, saving storage, memory, and disk/network bandwidth at every level. As a result, we can now retain all logs at a fraction of the cost, without throwing away any insights, and the compressed logs can be efficiently searched without decompression.
Crane: Uber’s Next-Gen Infrastructure Stack
Uber has been on a multi-year journey to reimagine our infrastructure stack for a hybrid, multi-cloud world. The internal code name for this project is Crane. In this post we’ll examine the original motivation behind Crane, requirements we needed to satisfy, and some key features of our implementation. Finally, we’ll wrap up with some forward-looking views for Uber’s infrastructure.
Deduping and Storing Images at Uber Eats
The Uber Eats system handles several hundred million product images and millions of image updates are performed every hour. We have implemented a content-addressable caching layer that very effectively detects duplicates and thereby reduces download times, processing times, and storage costs. This full system was developed and completely rolled out in the course of less than 2 months, which improved the latency and reliability of the image service and unblocked projects on our new catalog API development.
MySQL to MyRocks Migration in Uber's Distributed Datastores
Uber uses MySQL as the underlying database engine for Schemaless and Docstore, our distributed databases. By default, MySQL uses the most popular InnoDB engine, a B+Tree structure for data storage. MyRocks is a MySQL storage engine that integrates with RocksDB, an open source project. The RocksDB store is based on the log-structured merge-tree (or LSM tree) and is optimized for fast storage and combines outstanding space and write efficiency with acceptable read performance.
The Uber Storage Platform team has migrated all Schemaless instances and some Docstore instances to MyRocks since 2019. In this post, we are going to talk about the journey of our migration to MyRocks.
Uber Freight Carrier Metrics with Near-Real-Time Analytics
While most people are familiar with Uber, not all are familiar with Uber Freight. Uber Freight has been around since 2016 and is dedicated to provide a platform to seamlessly connect shippers with carriers. We’re simplifying the lives of trucking companies by providing a platform for carriers to browse through all available shipment opportunities with upfront pricing and book with the tap of a button, and making the fulfillment process more scalable and efficient.
Providing reliable service to shippers is critical for Uber Freight in order to gain their trust. Because carriers’ performance could significantly impact reliability of Freight’s service, we need to be transparent with carriers about the level to which we are holding them accountable, providing them with a clear view of how well they are performing and, if needed, where they can improve.
To achieve this, Uber Freight developed the Carrier Scorecard to show the carriers several metrics, including engagement with the app, on-time pickup/delivery, tracking automation, and late cancellations. By showing this information in near real time on the Carrier App, we are able to provide feedback to carriers in real time and set ourselves apart from most of our competitors in the industry.
Uber’s Next Gen Push Platform on gRPC
In our last blog post we talked about how we went from polling for refreshing the app to a push-based flow to build our app experience.
All our apps need to be synced with real-time information, whether it’s through pickup time, arrival time, and route lines on the screen, or nearby drivers when you open the app. We use our push platform to deliver these messages that power the real-time user experiences as described in our previous post, which we strongly recommend that you review to learn about the details of the architecture before proceeding.
This blog post will cover how we changed our protocol from Server Sent Events (HTTP1.1) to gRPC-based bidirectional streaming (QUIC/HTTP3), the challenges we faced, the final results, and some key learnings.
ML Education at Uber: Program Design and Outcomes
If you have read our previous article, ML Education at Uber: Frameworks Inspired by Engineering Principles, you have seen several examples of how Uber benefits from applying Engineering Principles to drive the ML Education Program’s content design and program frameworks.
In this follow-up, we will dig deeper into what we believe to be other unique aspects of ML Education at Uber: our approach to Content Components, Content Delivery, Observability, and Marketing & Reach.
ML Education at Uber: Frameworks Inspired by Engineering Principles
At Uber, millions of machine learning (ML) predictions are made every second, and hundreds of applied scientists, engineers, product managers, and researchers work on ML solutions daily.
Uber wins by scaling machine learning. We recognize org-wide that a powerful way to scale machine learning adoption is by educating. That’s why we created the Machine Learning Education Program: a program driven by Engineering Principles that provides a framework for delivering Uber-specific ML educational resources to Uber Tech employees.
Like a production system, education resources, contents, and distribution channels need to be continuously measured, evaluated, and improved. Ensuring each component of the ML Education Program is designed on this premise enabled us to quickly deliver new courses and curriculum that are tailored to engineers and scientists of various backgrounds.
This 2-part article will focus on how we have applied engineering principles when designing and scaling this program, and how it has helped us achieve the desired outcome. Part 1 will introduce our design principles and explain the benefits of applying these principles to technical education content design and program frameworks, specifically in the ML domain. Part 2 will take a closer look at critical components of the program and reflect on the outcomes that make ML Education at Uber a success.
Supercharging A/B Testing at Uber
In early 2020, we took a deeper look at this ecosystem. We discovered that a large percentage of the experiments had fatal problems and often needed to be rerun. Obtaining high-quality results required an expert-level understanding of experimentation and statistics, and an inordinate amount of toil (custom analysis, pipelining, etc.). This slowed down decision-making, and re-running poorly conducted experiments was common.
After assessing the customer problems and the internals of Morpheus we concluded that the core abstractions supported only a very narrow set of experiment designs correctly, and even minute deviations from such designs resulted in incomparable cohorts of users in control and treatment and compromised experiment results. To give a very simple example, while gradually rolling out an experiment that was split 30/70 between control and treatment, due to peculiarities of rollout and treatment assignment logic it would be ok to roll it out to 10% of users but not to 5%. Furthermore, the system was not able to support advanced experiment configurations needed to support Uber’s diverse use cases, or other advanced functionality at scale such as monitoring/rolling back experiments that were negatively impacting business metrics. So we decided to build a new platform from scratch with correct abstractions.
Vertical CPU Scaling: Reduce Cost of Capacity and Increase Reliability
This blog post describes the implementation of an automated vertical CPU scaling system in which every storage workload running at Uber is allocated the ideal amount of cores. The framework is used today to right-size more than 500,000 Docker containers, and since its inception it has applied a net reduction of allocations of more than 120,000 cores, leading to annual multi-million dollar savings in infrastructure spending.
Uber’s Highly Scalable and Distributed Shuffle as a Service
Uber is a data-driven company that heavily relies on offline and online analytics for decision-making. As Uber’s data grows exponentially every year, it’s crucial to process this data very efficiently and with minimum cost. Over the years, Apache Spark™ has become the primary compute engine at Uber to satisfy such data needs. Spark empowers many business-critical use cases at Uber with its unique features, including Uber rides, Uber Eats, autonomous vehicles, ETAs, Maps, and many more. Spark is the primary engine for data warehousing, data science, and AI/ML. In the last few years, Uber’s Spark usage has grown exponentially year over year, running on more than 10,000 nodes in production. Spark jobs now account for more than 95% of analytics cluster compute resources which process hundreds of petabytes of data every day.
Introducing Shadower: A Minimalistic Load Testing Tool
Shadower is a load testing tool that allows us to provide load testing as a service to any microservice at Uber.
Shadower started as a command line application that allowed us to read a local file to load test a local application. At the time, Maps PEs were heavily investing in Java GC tuning. We needed Shadower to be able to do request mirroring to make sure two different applications get about the same load and different types of loads (test multiple endpoints). In summary, we were starting the same application twice: once with the production configuration and once with the new GC tuning. How’s this different from testing on staging? There are a few differences:
- Velocity: Testing on staging means pushing code to a branch, generating a build, and deploying it. This can take ~10 minutes compared to seconds locally, because locally we don’t need to generate a new build for configuration changes.
- Complexity: Our current infrastructure requires extra steps to allow multiple builds to run concurrently in the same environment, the reason for this is to reduce inconsistencies.
- Reliability: The current tools didn’t provide request mirroring, so we are at the mercy of the load balancing algorithm and how well randomized the requests are that we are using for our load test.
- Ease of use: The current tools needed developers to write code to be able to onboard a new service/endpoint.
Shadower stayed as a command line application for about a year until we built Ballast. Ballast required a reliable and easy-to-manage load test generator. Ballast is in charge of tracking resources utilization and deciding what is the right load for any load test. Shadower only focuses on executing those load tests.
How We Halved Go Monorepo CI Build Time
Before 2021, Uber engineers would have to take quite a taxing journey to make a code change to the Go Monorepo. First, the engineer would make their changes on a local branch and put up a code revision to our internal code review system, Phabricator. Next, our infrastructure would see the request and initiate a number of validation jobs on our CI. Those jobs would run build and test validation using the Bazel™ build system, check the coverage, do some other work, and report back to the user a red light (i.e., tests failed or some other issues) or green light. Next, the user, after seeing the “green light,” would get their code reviewed and then initiate a “land” request to the Submit Queue. The queue, after receiving their request, would patch their changes on the latest HEAD of the main branch and re-run these associated builds and tests to make sure their change would be valid at the current state of the repository. If everything looked good, the changes would be pushed and the revision would be closed.
This sounds pretty easy, right? Make sure everything is green, reviewed, and then let the queue do the work to push your change!
Well… what if the change is to a fundamental and widely used library? All packages depending on the library will need to be validated, even though the change itself is only a few lines of code. This might slow down the “build and test” part. Validating this change could sometimes take several hours. Internally, we call such changes big changes.
Enabling Offline Inferences at Uber Scale
At Uber we use data from user support interactions to identify gaps in our products and create better, more delightful experiences for our users. Support interactions with customers include information about broken product experiences, any technical or operational issues faced, and even their general sentiment towards the product and company. Understanding the root cause of a broken product experience requires additional context, such as details of the trip or the order. For example, the root cause for a customer issue about a delayed order might be due to a bad route given to the courier. In this case, we would want to attribute the poor customer experience to courier routing errors so that the Maps team can fix the same.
Initially, we had manual agents review a statistically significant sample from resolved support interactions. They would manually verify and label the resolved support issues and assign root cause attribution to different categories and subcategories of issue types. We wanted to build a proof-of-concept (POC) that automates and scales this manual process by applying ML and NLP algorithms on the semi-structured or unstructured data from all support interactions, on a daily basis.
This article describes the approach we took and the end-to-end design of our data processing and ML pipelines for our POC, which optimized the ease of building and maintaining such high scale offline inference workflows by engineers and data scientists on the team.