在 Pinterest 上使用 Apache YuniKorn™ 管理 AWS EKS 上的 Apache Spark™ 资源
Yongjun Zhang; Staff Software Engineer | William Tom; Staff Software Engineer | Sandeep Kumar; Software Engineer |
Yongjun Zhang; Staff Software Engineer | William Tom; Staff Software Engineer | Sandeep Kumar; Software Engineer |
Monarch, Pinterest’s Batch Processing Platform, was initially designed to support Pinterest’s ever-growing number of Apache Spark and MapReduce workloads at scale. During Monarch’s inception in 2016, the most dominant batch processing technology around to build the platform was Apache Hadoop YARN. Now, eight years later, we have made the decision to move off of Apache Hadoop and onto our next generation Kubernetes (K8s) based platform. These are some of the key issues we aim to address:
Monarch,Pinterest的批处理平台,最初设计是为了支持Pinterest不断增长的大规模Apache Spark和MapReduce工作负载。在2016年Monarch诞生时,构建平台的最主要的批处理技术是Apache Hadoop YARN。现在,八年后,我们决定从Apache Hadoop迁移到我们的下一代基于Kubernetes(K8s)的平台。这是我们希望解决的一些关键问题:
- Application isolation with containerization: In Apache Hadoop 2.10, YARN applications share the same common environment without container isolation. This often leads to hard to debug dependency conflicts between applications.
- 使用容器化进行应用程序隔离:在Apache Hadoop 2.10中,YARN应用程序共享相同的公共环境,没有容器隔离。这通常会导致应用程序之间难以调试的依赖冲突。
- GPU support: Node labeling support was added to Apache Hadoop YARN’s Capacity Scheduler (YARN-2496) and not Fair Scheduler (YARN-2497), but at Pinterest we are heavily invested in Fair Scheduler. Upgrading to a newer Apache Hadoop version with node labeling support in Fair Scheduler or migrating to Capacity Scheduler will require tremendous engineering effort.
- GPU支持:节点标签支持已添加到Apache Hadoop YARN的Capacity Scheduler中(YARN-2496),而不是Fair Scheduler(YARN-2497),但在Pinterest我们大量投资于Fair Scheduler。升级到支持Fair Scheduler中节点标签的更新Apache Hadoop版本或迁移到Capacity Scheduler将需要巨大的工程努力。
- Hadoop upgrade effort: In 2020, we upgraded from Apache Hadoop 2.7 to 2.10. This minor version upgrade process took approximately one year. A major version upgrade to 3.x will take us significantly more time.
- Hadoop升级工作:在2020年,我们从Apache Hadoop 2...