使用Apache Hudi的增量ETL设置Uber的事务性数据湖。
The Global Data Warehouse team at Uber democratizes data for all of Uber with a unified, petabyte-scale, centrally modeled data lake. The data lake consists of foundational fact, dimension, and aggregate tables developed using dimensional data modeling techniques that can be accessed by engineers and data scientists in a self-serve manner to power data engineering, data science, machine learning, and reporting across Uber. The ETL (extract, transform, load) pipelines that compute these tables are thus mission-critical to Uber’s apps and services, powering core platform features like rider safety, ETA predictions, fraud detection, and more. At Uber, data freshness is a key business requirement. Uber invests heavily in engineering efforts that process data as quickly as possible to keep it up to date with the happenings in the physical world.
Uber的全球数据仓库团队通过一个统一的、PB级的、集中建模的数据湖,使Uber的所有数据民主化。该数据湖由基础的事实、维度和聚合表组成,使用维度数据建模技术开发,工程师和数据科学家可以以自助方式访问,为整个Uber的数据工程、数据科学、机器学习和报告提供动力。因此,计算这些表格的ETL(提取、转换、加载)管道对Uber的应用程序和服务来说是至关重要的,它为乘客安全、ETA预测、欺诈检测等核心平台功能提供动力。在Uber,数据的新鲜度是一个关键的业务要求。Uber在工程方面投入了大量资金,以尽快处理数据,使其与物理世界中发生的事情保持同步。
In order to achieve such data freshness in our ETL pipelines, a key challenge is incrementally updating these modeled tables rather than recomputing all the data with each new ETL run. This is also necessary to operate these pipelines cost-effectively at Uber’s enormous scale. In fact, as early as 2016, Uber introduced a new “transactional data lake” paradigm with powerful incremental data processing capabilities through the Apache Hudi project to address these challenges. We later donated the project to the Apache Software Foundation. Apache Hudi is now a top-level Apache project used industry wide in a new emerging technology category called the lakehouse. During this time, we are excited to see that the industry has largely moved away from bulk data ingestion towards a more incremental ingestion model that Apache Hudi ushered in at Uber. In this blog, we share our work over the past year or so in exte...