在我们的系统中发现僵尸:CPU 瓶颈的真实故事

Vaibhav Shankar; Staff Software Engineer | Raymond Lee; Staff Software Engineer | Chia-Wei Chen; Staff Software Engineer | Shunyao Li; Sr. Software Engineer | Yi Li; Staff Software Engineer | Ambud Sharma; Principal Engineer | Saurabh Vishwas Joshi; Principal Engineer | Charles-A. Francisco; Senior Engineer | Karthik Anantha Padmanabhan; Director, Engineering | David Westbrook; Sr. Manager, Engineering

Vaibhav Shankar;资深软件工程师 | Raymond Lee;资深软件工程师 | Chia-Wei Chen;资深软件工程师 | Shunyao Li;高级软件工程师 | Yi Li;资深软件工程师 | Ambud Sharma;首席工程师 | Saurabh Vishwas Joshi;首席工程师 | Charles-A. Francisco;高级工程师 | Karthik Anantha Padmanabhan;工程总监 | David Westbrook;高级工程经理

One day in early 2025, the Kubernetes platform team at Pinterest (PinCompute) got a ping from our partners on the ML platform team. Their Ray-based training jobs , which often take hours of computation on expensive GPU hardware, were crashing. Not every time, but often enough that it was becoming noticeable. Their logs indicated that their distributed training jobs were seeing intermittent loss of network connectivity, and that ultimately caused their jobs to crash. Their ask was simple:

2025年初的一天,Pinterest 的 Kubernetes 平台团队(PinCompute)收到了 ML 平台团队合作伙伴的 ping。他们的 基于 Ray 的训练作业,这些作业通常在昂贵的 GPU 硬件上需要数小时计算,却频频崩溃。并非每次都崩溃,但频率已足够引起注意。他们的日志显示,其分布式训练作业出现了间歇性的网络连接丢失,这最终导致作业崩溃。他们的要求很简单:

  1. Why is this happening?
  2. 这是为什么发生的?
  3. Can you please make it stop?
  4. 请让它停止好吗?

What started there led to a more than three-month-long investigation and a great lesson in profiling performance bottlenecks. Read on to learn from our fun story about CPU bottlenecks, AWS network drivers, and yes, how we discovered Zombies in our system!

从那里开始的事情导致了一场超过三个月的调查,以及关于性能瓶颈分析的宝贵一课。请继续阅读,了解我们关于 CPU bottlenecks、AWS network drivers,以及是的,我们如何在系统中发现 Zombies 的有趣故事!

Background: Ray at Pinterest

背景:Pinterest 中的 Ray

At Pinterest, Ray has risen as the backbone of our next-gen ML training and inference. Over the past few years, it has enabled us to scale systems, accelerate experimentation, and significantly boost the performance of m...

开通本站会员,查看完整译文。

Accueil - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.1. UTC+08:00, 2026-04-17 07:23
浙ICP备14020137号-1 $Carte des visiteurs$