加速深度学习：Uber 如何优化 Petastorm 以实现高吞吐量和可重现的 GPU 训练

April 9, 2026

2026 年 4 月 9 日

Introduction

引言

At Uber, we train massive deep learning models to power our marketplace and core services. As these models grow in complexity, the infrastructure required to train them becomes a significant cost and performance factor. Specifically, maximizing the utilization of our high-performance GPUs during training, without sacrificing reproducibility, is a constant engineering challenge.

在 Uber，我们训练海量深度学习模型来驱动我们的市场和核心服务。随着这些模型复杂度的增加，训练它们所需的基础设施成为重要的成本和性能因素。具体来说，在训练期间最大化高性能 GPU 的利用率，同时不牺牲可重现性，是一个持续的工程挑战。

Recently, one of our core machine learning teams identified a critical bottleneck in their training pipeline. They were working with a massive dataset—tens of terabytes in size, containing tens of billions of rows and hundreds of features. At this scale, the data loading pipeline couldn’t keep up with the compute speed, causing GPUs to idle at low utilization rates (10-15%) while waiting for data.

最近，我们的一个核心机器学习团队在其训练管道中识别出一个关键瓶颈。他们正在处理一个海量数据集——大小达数十 TB，包含数十亿行和数百个特征。在这种规模下，数据加载管道无法跟上计算速度，导致 GPU 在等待数据时以低利用率（10-15%）空闲。

At the same time, we observed another issue: reproducibility. As the training stack scaled to larger models and higher parallelism, we saw a significant increase in variance in the key evaluation metrics—even with identical configurations and seeds.. This made it harder to compare model architectures reliably and to ensure consistency across production training jobs.

同时，我们观察到另一个问题：可重复性。随着训练栈扩展到更大的模型和更高的并行度，即使使用相同的配置和种子，我们也在关键评估指标中看到了显著的方差增加.. 这使得可靠比较模型架构和确保生产训练作业的一致性变得更加困难。

In this blog, we first walk through how we engineered a solution to resolve these data bottlenecks within the Petastorm data loader. In our case study, these optimizations drastically improved throughput for these massive datasets, increasing GPU utilization to over 60%, reducing end-to-end training time from 22 hours to 3 hours, and slashing compute costs by nearly 80%.

在本博客中，我们首先介绍如何在 Petastorm 数据加载器中设计解决方案来解决这些数据瓶颈。在我们的案例研究中，这些优化措施极大地提高了这些海量数据集的吞吐量，将 GPU 利用率提高到 60% 以上，将端到端训练时间从 22 小时减少到 3 小时，并将计...