在 Pinterest 大幅减少 Apache Spark 中的内存不足错误

[

[

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--c55d7dac2257---------------------------------------)

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--c55d7dac2257---------------------------------------)

Felix Loesing | Software Engineer

Felix Loesing | Software Engineer

In 2025, we set out to drastically reduce out-of-memory errors (OOMs) and cut resource usage in our Spark applications by automatically identifying tasks with higher memory demands and retrying them on larger executors with a feature we call Auto Memory Retries.

2025 年,我们着手通过自动识别内存需求更高的任务并使用我们称为 Auto Memory Retries 的功能在更大的 executor 上重试它们,来大幅减少 Spark 应用中的内存不足错误 (OOMs) 并降低资源使用量。

Spark Platform

Spark 平台

Pinterest runs a large-scale Apache Spark deployment to satisfy the increasing demands of internal customers, such as AI/ML, experimentation, and reporting. We process 90k+ Spark jobs daily on tens of thousands of compute nodes with hundreds of PB in shuffle size.¹ Our clusters are run on Kubernetes and mainly use Spark 3.2, with an upgrade to Spark 3.5 in progress. We use Apache Celeborn as our shuffle service, Apache Yunikorn as our scheduler, accelerate computation with Apache Gluten & Meta’s Velox, and use our in-house submission service called Archer. Check out this blogpost to learn more about our data infrastructure here.

Pinterest 运行大规模的 Apache Spark 部署,以满足内部客户日益增长的需求,例如 AI/ML、实验和报告。我们每天在数万个计算节点上处理 90k+ 个 Spark 作业,shuffle 大小达数百 PB。¹ 我们的集群运行在 Kubernetes 上,主要使用 Spark 3.2,正在升级到 Spark 3.5。我们使用 Apache Celeborn 作为 shuffle 服务,Apache Yunikorn 作为调度器,使用 Apache Gluten & Meta’s Velox 加速计算,并使用我们内部的提交服务 Archer。查看此 blogpost 以了解更多关于我们数据基础设施的信息 here

Problem Identification

问题识别

Historically, we knew that OOM errors were frequent in our clusters due to small executor sizes. Increasing them is not as easy as our clusters are memory bound, meaning that the core to memory ratio of our jobs is higher than that of the physical hardware. Our main approach to get our jobs’ memory ratio closer to the hardware is to continuou...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2026 iteam. Current version is 2.153.0. UTC+08:00, 2026-02-18 23:05
浙ICP备14020137号-1 $访客地图$