With the evolution of storage table formats Apache Hudi®, Apache Iceberg®, and Delta Lake™, more and more companies are building up their lakehouse on top of these formats for many use cases, like incremental ingestion. But the speed of upserts sometimes is still a problem when the data volumes go up.
In storage tables, Apache Parquet is used as the main file format. In this article, we will discuss how we built a row-level secondary index and the innovations we introduced in Apache Parquet to speed up the upsert data inside a Parquet file. We will also demonstrate benchmarking results that show much faster speeds than traditional copy-on-write in Delta Lake and Hudi.
Uber's Financial Computation Service processes various types of money movements from multiple upstreams, generating journal entries that can be linked to actual transactions for accuracy. A rulePlan, a directed tree JSON representation, guides execution. Audit events are created for each transaction, containing all information needed to trace back to business flow and rules. Recompute and reconciliation are possible to ensure reproducibility. All upstream events are consumed and processed by the service.
Uber has made a public commitment to phase out carbon emissions in the United States, Canada, and Europe by 2030, and worldwide by 2040. We maintain periodic updates on our progress via our Climate Assessment and Performance Report, which shows both how far we’ve come and how far we have yet to go.
Underlying this report is a large effort to prepare the data presented within, everything from identifying vehicle fuel type across markets, to carbon emissions per vehicle-mile and ultimately to passenger-mile traveled. In this blog post, we’ll take a closer look at the complexities of identifying “green” vehicles onboarded to Uber, and our solutions for managing those data.
Apache Spark™ is a widely used open source distributed computing engine. It is one of the main components of Uber’s data stack.
Spark is the primary batch compute engine at Uber. Like any other framework, Spark comes with its own set of tradeoffs.
Uber has one of the largest Hadoop® Distributed File System (HDFS) deployments in the world, with exabytes of data across tens of clusters. It is important, but also challenging, to keep scaling our data infrastructure with the balance between efficiency, service reliability, and high performance. As a cost efficiency improvement effort that will save us tens of millions dollars every year, we aim to adopt higher density HDD (16+TB) SKUs to replace existing SKUs with 4TB HDDs that are still used by the majority of our HDFS clusters.
One of the biggest challenges when fully adopting high-density disk SKU comes from the disk IO bandwidth. While the capacity of each HDD increases by 2x to 4x, the I/O bandwidth of each HDD does not increase accordingly. This may cause IO throttling when DataNodes serve read/write requests. This can be seen from the chart below, which shows the trend of slow read packet read count from one DataNode. Given the persistent and sizable number of slow read occurrences, it is important to find new approaches to prevent performance degradation.
All the best things come in threes: the Three Musketeers, the Three Stooges, and, of course, your favorite three-cheese pizza ordered via the UberEats app. Engineering Security (EngSec) at Uber agrees and we have formed our own trio for how we simulate cybersecurity incidents at Uber to exercise our ability to act decisively should an incident occur. This three-pronged approach consists of tabletop exercises, red team operations, and atomic simulations.
In November 2021 we decided to evaluate arm64 for Uber. Most of our services are written in either Go or Java, but our build systems only supported compiling to x86_64. Today, thanks to Open Source collaboration, Uber has a system-independent (hermetic) build toolchain that seamlessly powers multiple architectures. We used this toolchain to bootstrap our arm64 hosts. This post is a story with how we went about it, our early thinking, problems, some achievements, and next steps.
At Uber, we obsess over delivering highly performant and reliable experiences to our partners and customers. We treat degradations to app performance the same way as any other functional regressions.…
At Uber, we put safety first in order to minimize risks for users on the Uber platform. Uber Insurance Tech focuses on three pillars; claims, compliance, and affinity programs.
Airports currently hold a significant portion of Uber’s supply and open supply hours (i.e., supply that is not utilized, but open for dispatch) across the globe. At most airports, drivers are obligated to join a “first-in-first-out” (FIFO) queue from which they are dispatched. When the demand for trips is high relative to the supply of drivers in the queue (“undersupply”), this queue moves quickly and wait times for drivers can be quite low. However, when demand is low relative to the amount of available supply (“oversupply”), the queue moves slowly and wait times can be very high. Undersupply creates a poor experience for riders, as they are less likely to get a suitable ride. On the other hand, oversupply creates a poor experience for drivers as they are spending more time waiting for each ride and less time driving. What’s more, drivers don’t currently have a way to see when airports are under- or over-supplied, which perpetuates this problem.
One way to tackle this undersupply/oversupply issue at airports is to forecast supply balance and use this to optimize resource allocation. Our first application of these models is in estimating the time to request (ETR) for the airport driver queue. We estimate the length of time a driver would have to wait before they receive a trip request, thereby giving drivers the information they need to identify and reposition in periods of undersupply (short waits), or to remain in the city during periods of oversupply (long waits).
The Global Data Warehouse team at Uber democratizes data for all of Uber with a unified, petabyte-scale, centrally modeled data lake. The data lake consists of foundational fact, dimension, and aggregate tables developed using dimensional data modeling techniques that can be accessed by engineers and data scientists in a self-serve manner to power data engineering, data science, machine learning, and reporting across Uber. The ETL (extract, transform, load) pipelines that compute these tables are thus mission-critical to Uber’s apps and services, powering core platform features like rider safety, ETA predictions, fraud detection, and more. At Uber, data freshness is a key business requirement. Uber invests heavily in engineering efforts that process data as quickly as possible to keep it up to date with the happenings in the physical world.
In order to achieve such data freshness in our ETL pipelines, a key challenge is incrementally updating these modeled tables rather than recomputing all the data with each new ETL run. This is also necessary to operate these pipelines cost-effectively at Uber’s enormous scale. In fact, as early as 2016, Uber introduced a new “transactional data lake” paradigm with powerful incremental data processing capabilities through the Apache Hudi project to address these challenges. We later donated the project to the Apache Software Foundation. Apache Hudi is now a top-level Apache project used industry wide in a new emerging technology category called the lakehouse. During this time, we are excited to see that the industry has largely moved away from bulk data ingestion towards a more incremental ingestion model that Apache Hudi ushered in at Uber. In this blog, we share our work over the past year or so in extending this incremental data processing model to our complex ETL pipelines to unlock true end-to-end incremental data processing.
Uber has multiple, domain-specific products to manage and distribute configuration changes at runtime across our many systems. These configuration products cater to different use cases: some have a web UI that can be used by non-engineers to change product configuration for different cities, and others expose a Git-based interface that primarily caters to engineers.
While these domain-specific configuration products have different applications, they share common parts that can be consolidated for simplicity and to reduce the overhead of operations, maintenance, and compliance. This article will cover how we consolidated and streamlined our underlying configuration and rollout mechanisms, including some of the interesting challenges we solved along the way, and the efficiencies we achieved by doing so.
Uber has made a commitment to sustainability by setting several goals across various sectors. By 2030, Uber plans to become a zero-emission mobility platform in Canada, Europe, and the US – and by 2040, worldwide. Uber Green, which offers no- or low-emission rides, has become the most widely-available option of its kind globally. However, this commitment encompasses more than just rides, as it also includes Uber’s engineering infrastructure such as its data centers and hardware resources, both on-premise and in public clouds.
As engineers and technology leaders, we nurture and develop the concept of responsible ownership, which is often thought of as maintaining high quality of our products. Responsible ownership also implies building efficient services, of which metrics for energy efficiency and sustainability should be an integral part.
In late 2021, we embarked on a journey to find out the best sustainable engineering practices, tools, and technologies, and began building them into our services, products, and training sessions. In this article, we present our vision and roadmap, walk through Uber Eng best practices for engineering sustainably towards a zero-emission world, and introduce novel, sustainability-oriented services.
Data powers almost all critical, customer-facing flows at Uber. Bad data quality impacts our ML models, leading to a bad user experience (incorrect fares, ETAs, products, etc.) and revenue loss.
Still, many data issues are manually detected by users weeks or even months after they start. Data regressions are hard to catch because the most impactful ones are generally silent. They do not impact metrics and ML models in an obvious way until someone notices something is off, which finally unearths the data issue. But by that time, bad decisions are already made, and ML models have already underperformed.
This makes it critical to monitor data quality thoroughly so that issues are caught proactively.
We encountered an unusual problem recently at Uber with Golang™ debugging, as our engineers began transitioning to Apple® Silicon hardware, which uses the ARM64 Instruction Set Architecture (ISA), rather than the x86/AMD64 ISA many of us have been using for many years now. This required some rather complex debugging of the toolchain itself by Uber engineers.
At a company as large and as complex as Uber, the volume of internal communication can easily become overwhelming. Our employees use multiple independent systems to send and receive a variety of notifications every single day. These include:
- Pullo – to raise access requests for software tools and services
- Uber Feedback – to provide and receive co-worker feedback
- Uber Learning – to undertake assigned training
- Workday – to apply for leaves, sign HR contracts, etc.
- ERD-PRD tool – to seek and provide approval of engineering and product documentation
- UberHub – to make requests for IT infra, hardware & software, etc.
Thus employees receive several action items and approval requests from multiple applications on a daily basis. Unified Action Platform or uAct has been built with a view to help employees keep on top of their assigned tasks and action items. uAct aggregates all such requests into one place for employees to easily view and address.