Containerizing Apache Hadoop Infrastructure at Uber

In 2019, we started a journey to re-architect the Hadoop deployment stack. Fast forward 2 years, over 60% of Hadoop runs in Docker containers, bringing major operational benefits to the team. As a result of the initiative, the team handed off many of their responsibilities to other infrastructure teams, and was able to focus more on core Hadoop development.

This article provides a summary of problems we faced, and how we solved them along the way.

‘Orders Near You’ and User-Facing Analytics on Real-Time Geospatial Data

By its nature, Uber’s business is highly real-time and contingent upon geospatial data. PBs of data are continuously being collected from our drivers, riders, restaurants, and eaters. Real-time analytics over this geospatial data could provide powerful insights.

In this blog, we will highlight the Orders near you feature from the Uber Eats app, illustrating one example of how Uber generates insights across our geospatial data.

Orders near you was a recent collaboration between the Data and Uber Eats teams at Uber. The project’s goal was to create an engaging and unique social experience for eaters. We hoped to inspire new food and restaurant discovery by showing what your neighbors are ordering right now. Since this feature is part of our home feed, we needed it to be fast, personalized, and scalable.

Analyzing Customer Issues to Improve User Experience

The primary goal for customer support is to ensure users’ issues are addressed and resolved in a timely and effective manner. The kind of issues users face and what they say in their support interactions provides a lot of information about the product experience, any technical or operational gaps and even their general sentiment towards the product / company. At Uber, we don’t stop at just resolving user issues. We also use the issues reported by customers to improve our support experience and our products. This article describes the technology that makes it happen.

Customer Support Automation Platform at Uber

If you’ve used any online/digital service, chances are that you are familiar with what a typical customer service experience entails: you send a message (usually email aliased) to the company’s support staff, fill out a form, expect some back and forth with a customer service representative (CSR), and hopefully have your issue resolved. This process can often feel inefficient and slow. Typically, this might be attributable to the tooling/processes made available to CSRs for solving your issue. For any given issue, the CSR has to navigate standard operating procedures (SOPs, a.k.a. flow) with proliferating undocumented branches/edge cases making their work mundane, tedious, and imprecise. The manual maintenance and navigation of these SOPs can create a bureaucratic bottleneck, which ultimately leaves the customer dissatisfied.

Tuning Model Performance

Uber uses machine learning (ML) models to power critical business decisions. An ML model goes through many experiment iterations before making it to production. During the experimentation phase, data scientists or machine learning engineers explore adding features, tuning parameters, and running offline analysis or backtesting. We enhanced the platform to reduce the human toil and time in this stage, while ensuring high model quality in production.

Elastic Distributed Training with XGBoost on Ray

In this blog, we discuss how moving to distributed XGBoost on Ray helps address these concerns and how finding the right abstractions allows us to seamlessly incorporate Ray and XGBoost Ray into Uber’s ML ecosystem. Finally, we cover how moving distributed XGBoost onto Ray, in parallel with efforts to move Elastic Horovod onto Ray, serves as a critical step towards a unified distributed compute backend for end-to-end machine learning workflows at Uber.

Continuous Integration and Deployment for Machine Learning Online Serving and Models

At Uber, we have witnessed a significant increase in machine learning adoption across various organizations and use-cases over the last few years. Our machine learning models are empowering a better customer experience, helping prevent safety incidents, and ensuring market efficiency, all in real time. The figure above is a high level view of CI/CD for models and service binary.

One thing to note is we have continuous integration (CI)/continuous deployment (CD) for models and services, as shown above in Figure 1. We arrived at this solution after several iterations to address some of MLOps challenges, as the number of models trained and deployed grew rapidly. The first challenge was to support a large volume of model deployments on a daily basis, while keeping the Real-time Prediction Service highly available. We will discuss our solution in the Model Deployment section.

Efficient and Reliable Compute Cluster Management at Scale

Uber relies on a containerized microservice architecture. Our need for computational resources has grown significantly over the years, as a consequence of business’ growth. It is an important goal now to increase the efficiency of our computing resources. Broadly speaking, the efficiency efforts in compute cluster management involve scheduling more workloads on the same number of machines. This approach is based on the observation that the average CPU utilization of a typical cluster is far lower than the CPU resources that have been allocated to it. The approach we have adopted is to overcommit CPU resources, without compromising the reliability of the platform, which is achieved by maintaining a safe headroom at all times. Another possible and complementary approach is to reduce the allocations of services that are overprovisioned, which we also do. The benefit of overcommitment is that we are able to free up machines that can be used to run non-critical, preemptible workloads, without purchasing extra machines.

Handling Flaky Unit Tests in Java

Unit testing forms the bedrock of any Continuous Integration (CI) system. It warns software engineers of bugs in newly-implemented code and regressions in existing code, before it is merged. This ensures increased software reliability. It also improves overall developer productivity, as bugs are caught early in the software development lifecycle. Hence, building a stable and reliable testing system is often a key requirement for software development organizations.

Scaling of Uber's API gateway




The Architecture of Uber's API gateway



pprof++: A Go Profiler with Hardware Performance Monitoring


虽然与其他几种语言相比,内置的Go剖析器比没有剖析器要好,但Go中事实上的CPU剖析器在基于Linux的系统上(也可能在其他操作系统上)有严重的局限性,并且缺乏许多[1, 2, 3, 4]充分理解CPU瓶颈所需的细节。

Automating Merchant Live Monitoring with Real-Time Analytics: Charon

在Uber,运营的实时监控和自动化对于维护市场健康、保持可靠性和获得市场效率至关重要。根据 "实时 "一词的含义,这种监控需要显示现在正在发生的事情,及时获取新鲜数据,并能够根据这些数据建议适当的行动。Uber的数据平台提供自我服务工具,使运营团队能够建立自己的实时监控工具,并通过建立丰富的解决方案来支持他们的区域团队。


在2020年初推出的Charon最终非常有效,它使在COVID-19封锁期间关闭的餐馆和其他商户只需提供送货服务。在这篇文章中,我们以Charon COVID-19的用例来说明Uber的数据平台是如何赋予团队更快建立和适应的。

Optimal Feature Discovery: Better, Leaner Machine Learning Models Through Information Theory




Freight Pricing with a Controlled Markov Decision Process

Uber Freight于2017年推出,旨在彻底改变巨大而低效的货运卡车行业中托运人和承运人的匹配业务(在美国每年花费约8000亿美元)。我们相信,并且已经证明,一个以技术为先的货运经纪人和市场可以为承运人提供更好的机会,并为托运人和社区提供卓越的结果。

我们希望通过技术消除的一个浪费过程是,传统的货运经纪人和承运人之间为货物(货运术语)的价格进行冗长的砍价。这种做法源于运费价格和承运人支付意愿的不透明。受到定价创新在Uber大规模增长中发挥的作用的启发,我们决定成为第一个提供透明的动态承运人定价的货运经纪人,通过先进的算法 "清理市场",而不是用老式的讨价还价,浪费了时间并从市场上吸引了流动性。


Flipr: Making Changes Quickly and Safely at Scale


