生产环境中零停机 PyTorch 升级:方法、陷阱与经验教训
Chi Zhang | Staff Software Engineer, ML Platform
Chen Yang | Sr. Staff Machine Learning Engineer, Applied Science; Lida Li | Sr. Staff Software Engineer, Ads ML Infrastructure
Pong Eksombatchai | (former) Principal Machine Learning Engineer, Applied Science; Saurabh Vishwas Joshi | Principal Engineer, ML Platform
Eric Lopez | Staff Site Reliability Engineer, Production Engineering; Mark Molinaro | Staff Software Engineer, Code and Language Runtime
Chi Zhang | 资深软件工程师,ML Platform
Chen Yang | 高级资深机器学习工程师,Applied Science;Lida Li | 资深软件工程师,Ads ML Infrastructure
Pong Eksombatchai | (前)首席机器学习工程师,Applied Science;Saurabh Vishwas Joshi | 首席工程师,ML Platform
Eric Lopez | 资深站点可靠性工程师,Production Engineering;Mark Molinaro | 资深软件工程师,Code and Language Runtime
Introduction
引言
At Pinterest, machine learning (ML) models power real-time recommendations in core experiences as well as advertising at web scale. Behind the scenes, PyTorch is the de facto ML framework, enabling both distributed training and online inference across GPU fleets.
在 Pinterest,机器学习 (ML) 模型为核心体验中的实时推荐以及万维网规模的广告提供动力。在幕后,PyTorch 是事实上的 ML 框架,支持跨 GPU 集群的分布式训练和在线推理。
By early 2025, Pinterest production was still running PyTorch 2.1 (October 2023) on CUDA 12.1. The more-than-a-year lag meant we were missing several important improvements introduced across subsequent PyTorch 2.x releases, including more capable torch.compile and TorchInductor compiler stack, better support for modern GPU architectures like Nvidia Hopper, and maturing training efficiency features such as FP8 training. To avoid falling behind that rapidly moving baseline, we set an explicit goal to upgrade our production stack from PyTorch 2.1 to 2.6 (January 2025), bringing the Pinterest ML ecosystem onto a more modernized release.
到 2025 年初,Pinterest 生产环境仍在运行 PyTorch 2.1(2023 年 10 月)和 CUDA 12.1。这种超过一年的滞后意味着我们错过了后续 PyTorch 2.x 版本中引入的几项重要改进,包括更强大的 torch.compile 和 TorchInductor 编译器栈、对现代 GPU 架构(如 Nvidia Hopper)的更好支持,以及 FP8 训练等成熟的训练效率功能。为了避免落后于快速移动的基准线,我们设定了明确目标:将生产栈从 PyTorch 2.1 升级到 2.6(2025 年 1 月),使 Pintere...