追踪神秘的 ML 训练卡顿

[

[

Pinterest Engineering

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--5290bb19be6d---------------------------------------)

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--5290bb19be6d---------------------------------------)

Chen Yang, Andrew Yu, Shunyao Li, Pong Eksombatchai, Mark Molinaro

Chen Yang, Andrew Yu, Shunyao Li, Pong Eksombatchai, Mark Molinaro

Introduction

引言

Pinterest’s ML training platform, MLEnv, is a collection of external software dependencies and in-house built high performance ML library centered around Meta AI’s PyTorch. Upgrading the PyTorch version involves the upgrade of a web of dependencies together with necessary code changes. In most cases, this is just mundane work… until it isn’t.

Pinterest 的 ML 训练平台 MLEnv 是一组外部软件依赖项与内部构建的高性能 ML 库的集合,围绕 Meta AI 的 PyTorch 展开。升级 PyTorch 版本需要同步升级一系列依赖项并进行必要的代码更改。大多数情况下,这只是枯燥的工作……直到它不再枯燥。

During Pinterest’s latest effort to upgrade PyTorch’s version, instead of seeing neutrality or improvements of training throughput, we saw a sharp drop (>50%) in performance. Debugging the root cause turns out to be an interesting journey, from observing the throughput impact all the way down to identifying the culprit low-level linux kernels. This blog post documents our debugging process. We hope it can help a broader audience for their future work.

在 Pinterest 最近一次升级 PyTorch 版本的尝试中,我们并未看到训练吞吐量的中性或提升,反而观察到性能急剧下降(>50%)。调试根本原因的过程是一段有趣的旅程,从观察吞吐量影响一路追溯到识别出罪魁祸首的低级 Linux 内核。这篇博客文章记录了我们的调试过程,希望能为更广泛的读者在未来的工作中提供帮助。

Background

背景

Before we start, let us briefly review the ML model training process. A distributed data parallel ML trainer generally contains these steps: data loading, model forward and backward pass, all-reduce synchronization among GPU ranks, and an optimizer that applies the gradients to model weights. Depending on the model architecture, there may be additional synchronization points during the forward and backward pass. At Pinterest, we also leverage Anyscale’s Ray to horizontally scale the data loader (see here), where the bulk of dat...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.146.0. UTC+08:00, 2025-10-24 15:27
浙ICP备14020137号-1 $访客地图$