在Pinterest使用Moka进行大规模下一代数据处理(第1部分,共2部分)
[
[
Soam Acharya: Principal Engineer · Rainie Li: Manager, Data Processing Infrastructure · William Tom: Senior Staff Software Engineer · Ang Zhang: Director, Big Data Platform
Soam Acharya:首席工程师 · Rainie Li:数据处理基础设施经理 · William Tom:高级员工软件工程师 · Ang Zhang:大数据平台总监
As Pinterest’s data processing needs grow and as our current Hadoop-based platform (Monarch) ages, the Big Data Platform (BDP) team within Pinterest Data Engineering started considering alternatives for our next generation massive scale data processing platform. In this blog post series, we share details of our subsequent journey, the architecture of our next gen data processing platform, and some insights we gained along the way. In part one, we provide rationale for our new technical direction prior to outlining the overall design and detailing the application focused layer of our platform. We conclude with current status and some of our learnings.
随着 Pinterest 的数据处理需求增长,以及我们当前的Hadoop 基础平台 (Monarch) 的老化,Pinterest 数据工程团队开始考虑我们下一代大规模数据处理平台的替代方案。在这系列博客文章中,我们分享了我们后续旅程的细节、下一代数据处理平台的架构,以及我们在此过程中获得的一些见解。在第一部分中,我们提供了新技术方向的理由,然后概述整体设计并详细说明我们平台的应用聚焦层。最后,我们总结了当前状态和我们的一些学习。
Introduction
介绍
Encouraged by its growing popularity and increasing adoption in the Big Data community, we explored Kubernetes (K8s)-based systems as the most likely replacement for Hadoop 2.x. Candidate platforms had to meet the following criteria:
受到其日益增长的受欢迎程度和在大数据社区中不断增加的采用率的鼓舞,我们探索了Kubernetes (K8s) 基础系统,作为 Hadoop 2.x 的最可能替代品。候选平台必须满足以下标准:
- Extensive support for containers to enhance platform data privacy and security
- 广泛支持容器以增强平台数据隐私和安全性
- Execute Pinterest’s custom Spark fork at comparable or better performance and scale
- 以可比或更好的性能和规模执行 Pinterest 的自定义Spark 分支
- Leverage key technical improvements such as GPU support, newer EC2 instance types such as ARM/Gr...