Hadoop™ 集群的自动迁移和扩展

[

[

Pinterest Engineering

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--69c0967228e4---------------------------------------)

](https://medium.com/@Pinterest_Engineering?source=post_page---byline--69c0967228e4---------------------------------------)

Joe Sabolefski, Sr. Site Reliability Engineer

Joe Sabolefski, Sr. Site Reliability Engineer

Pinterest Big Data Infrastructure

Pinterest 大数据基础设施

Much of Pinterest’s big data is processed using frameworks like MapReduce™, Spark™, and Flink™ on HadoopYARN™. The processing is carried out on many thousands of nodes spread across over a dozen clusters. We use AWS for our infrastructure, and each cluster uses Auto Scaling Groups (ASGs) to maintain cluster size. Because Hadoop is stateful, we do not auto-scale the clusters; each ASG is fixed in size (desired = min = max). Terraform is utilized to create each cluster. Before introducing the Hadoop Control Center (HCC), Terraform was also used to scale out the Auto Scaling Groups (ASGs). However, scaling in (downsizing) is a more complex process that requires several manual steps. We aimed to perform as many cluster operations automatically as possible 24/7 with minimal user intervention and no impact on workloads.

Pinterest的大数据主要使用像MapReduce™、Spark™和Flink™这样的框架在HadoopYARN™上进行处理。处理是在分布在十多个集群上的数千个节点上进行的。我们使用AWS作为我们的基础设施,每个集群使用自动扩展组(ASGs)来维持集群规模。由于Hadoop是有状态的,我们不对集群进行自动扩展;每个ASG的大小是固定的(期望 = 最小 = 最大)。Terraform用于创建每个集群。在引入Hadoop控制中心(HCC)之前,Terraform也用于扩展自动扩展组(ASGs)。然而,缩减规模(减少)是一个更复杂的过程,需要几个手动步骤。我们的目标是尽可能多地自动执行集群操作,24/7进行,尽量减少用户干预,并且不影响工作负载。

The Migration Challenge

迁移挑战

It may seem easier to configure and launch a new cluster with desired migration features, such as AMI (latest OS/Kernel) and instance type, but that’s not always the case. This method can work for small clusters and was used prior to HCC. However, with some of our clusters having over 3k+ nodes, using that method may not be feasible. We faced several major issues and concerns:

配置和启动具有所需迁移功能的新集群(例如AMI(最新操作系统/内核)和实例类型)似乎更容易,但情况并非总是如此。这种方法可以用于小型集群,并在HCC之前使用。然而,随着我们的...

开通本站会员,查看完整译文。

Главная - Вики-сайт
Copyright © 2011-2025 iteam. Current version is 2.144.0. UTC+08:00, 2025-06-10 04:04
浙ICP备14020137号-1 $Гость$