Uber升级200万+ Spark作业的战略
With over 2 million Apache Spark™ applications running daily, Uber operates one of the largest Spark deployments in the industry. Migrating Spark versions is no simple feat, given the scale and complexity of our operations. This blog delves into how Uber successfully navigated this migration, and the innovative automation and tooling we developed to make it possible.
每天运行超过 200 万个 Apache Spark™ 应用,Uber 拥有业内最大规模的 Spark 部署之一。鉴于我们运营的规模和复杂性,迁移 Spark 版本绝非易事。这篇博客深入探讨了 Uber 如何成功完成这一迁移,以及我们为此开发的创新自动化与工具。
Apache Spark has long been a cornerstone of Uber’s data infrastructure, enabling everything from data analytics and machine learning to real-time processing. Today, users launch over 2 million Spark applications through more than 20,000 scheduled workflows and thousands of interactive sessions each day. Until recently, all of these workloads ran on Spark 2.4. This blog explores the transition from Spark 2.4 to Spark 3.3 and the challenges faced along the way.
Apache Spark 长期以来一直是 Uber 数据基础设施的基石,支撑着从数据分析、机器学习到实时处理的各类场景。如今,用户每天通过 2 万多个定时工作流和数千个交互式会话启动超过 200 万个 Spark 应用。直到最近,这些工作负载全部运行在 Spark 2.4 上。本文将探讨从 Spark 2.4 迁移到 Spark 3.3 的过程以及途中遇到的挑战。
Due to its out of the box support for Kubernetes as a resource manager, we decided to move to Spark 3.3 to:
由于 Spark 3.3 对 Kubernetes 作为资源管理器提供了开箱即用的支持,我们决定迁移到 Spark 3.3,以:
- Improve efficiency and save cost from numerous Spark 3.3 optimizations (adaptive query execution and dynamic partition pruning)
- 通过 Spark 3.3 的众多 优化(自适应查询执行和动态分区裁剪)提升效率并节省成本
- Adopt common vulnerabilities and exposures fixes that make Spark more secure
- 采用 常见漏洞和暴露 修复,使 Spark 更安全
- Improve developer productivity with features like Koalas™ (pandas on PySpark)
- 通过 Koalas™(PySpark 上的 pandas)等功能提升开发者生产力
- Onboard other optimizations like Apache Gluten™ and Velox
- 集成 Apache Gluten™ 和 Velox 等其他优化方案
- Be on par with the latest open-source contributions to Spark
- 与 Spark 最新的开源贡献保持同步
At Uber, Spark applications are written in Java®, Scala, or Python. Either Apache Hadoop® YARN or Kubernetes® handles resource management, with both options su...