从批处理到流处理:加速Uber数据湖中的数据新鲜度
At Uber, the data lake is a foundational platform powering analytics and machine learning across the company. Historically, ingestion into the lake was powered by batch jobs with freshness measured in hours. As business needs evolved toward near-real-time insights, we re-architected ingestion to run on Apache Flink®, enabling fresher data, lower costs, and scalable operations at petabyte scale.
在Uber,数据湖是推动公司分析和机器学习的基础平台。历史上,数据湖的摄取是通过批处理作业进行的,新鲜度以小时为单位进行测量。随着业务需求向近实时洞察的演变,我们重新架构了摄取,使其在Apache Flink®上运行,从而实现了更新的数据、更低的成本和可扩展的PB级操作。
Over the past year, we built and validated IngestionNext, a new streaming-based ingestion system centered on Flink. We proved its performance on some of Uber’s largest datasets, designed the control plane for operating thousands of jobs, and addressed streaming-specific challenges such as small file generation, partition skew, and checkpoint synchronization. This blog describes the design of IngestionNext and early results that show improved freshness and meaningful efficiency gains compared to batch ingestion.
在过去的一年中,我们构建并验证了 IngestionNext,这是一个基于 Flink 的新流式摄取系统。我们在 Uber 的一些最大数据集上证明了其性能,设计了用于操作数千个作业的控制平面,并解决了流式特有的挑战,如小文件生成、分区倾斜和检查点同步。本文描述了 IngestionNext 的设计以及与批处理摄取相比,显示出改进的新鲜度和显著的效率提升的早期结果。
Two key drivers motivated our move from batch to streaming: data freshness and cost efficiency.
两个关键驱动因素促使我们从批处理转向流处理:数据新鲜度和成本效率。
As the business moved faster, the Delivery, Rider, Mobility, Finance, and Marketing Analytics organizations at Uber consistently asked for fresher data to power real-time experimentation and model development. Batch ingestion provides data with delays of hours—or in some cases, even days—limiting the speed of iteration and decision-making. By re-platforming ingestion on Flink, we cut freshness from hours to minutes. This shift directly accelerates model launches, experimentation velocity, and analytics accuracy across the company.
随着业务的快速发展,Uber的交付、乘客、移动性、财务和市场分析组织不断要求更新的数据,以推动实时实验和模型开发。批量摄取提供的数据延迟为数小时——在某些情况下,甚至是数天——限制了迭代和决策的速度。通过在Flink上重新平台化摄取,我们将新鲜度从小时减少到分钟。这一转变直接加速了模型发布...