Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest
Yongjun Zhang; Staff Software Engineer | William Tom; Staff Software Engineer | Sandeep Kumar; Software Engineer |
Monarch, Pinterest’s Batch Processing Platform, was initially designed to support Pinterest’s ever-growing number of Apache Spark and MapReduce workloads at scale. During Monarch’s inception in 2016, the most dominant batch processing technology around to build the platform was Apache Hadoop YARN. Now, eight years later, we have made the decision to move off of Apache Hadoop and onto our next generation Kubernetes (K8s) based platform. These are some of the key issues we aim to address:
- Application isolation with containerization: In Apache Hadoop 2.10, YARN applications share the same common environment without container isolation. This often leads to hard to debug dependency conflicts between applications.
- GPU support: Node labeling support was added to Apache Hadoop YARN’s Capacity Scheduler (YARN-2496) and not Fair Scheduler (YARN-2497), but at Pinterest we are heavily invested in Fair Scheduler. Upgrading to a newer Apache Hadoop version with node labeling support in Fair Scheduler or migrating to Capacity Scheduler will require tremendous engineering effort.
- Hadoop upgrade effort: In 2020, we upgraded from Apache Hadoop 2.7 to 2.10. This minor version upgrade process took approximately one year. A major version upgrade to 3.x will take us significantly more time.
- Hadoop community support: The industry as a whole has been moving to K8s and Apache Hadoop is largely in maintenance mode.
Over the past few years, we’ve observed widespread industry adoption of K8s to address these challenges. After a lengthy internal proof of concept and evaluation we made the decision to build out our next generation K8s-based platform: Moka (Monarch on K8s).
In this blog, we’ll be covering the challenges and corresponding solutions that enable us to migrate thousands of Spark applications from Monarch to Moka while maintaining resource efficiency and a high quality of service. You can check out our previous blog, which describes how Monarch resources are managed efficiently to achieve cost saving while ensuring job stability. At the time of writing this blog, we have migrated half of the Spark workload running on Monarch to Moka. The remaining MapReduce jobs running on Monarch are actively being migrated to Spark as part of a separate initiative to deprecate MapReduce within Pinterest.
Achieving Efficient Resource Management
Our goal with Moka resource management was to retain the positive properties of Monarch’s resource management and improve upon its shortcomings.
First, let’s cover the things that worked well on Monarch that we wanted to bring over to Moka:
- Associating each workflow with its owner’s organization and project
- Classifying all workflows into three tiers: tier 1 (highest priority), tier 2, and tier 3, and defining the runtime SLO for each workflow
- Using hierarchical org-base-queue (OBQ) in the format of
root.. for resource allocation and workload scheduling. - Tracking per-application runtime data, including the application’s start and end time, memory and vCore usage, etc.
- Efficient tier-based resource allocation algorithms that automatically adjust OBQs based on the historical resource usage of the workflows assigned to the queues
- Services that can route workloads from one cluster to another, which we call cross-cluster-routing (CCR), to their corresponding OBQs
- Onboarding queues with reserved resources to onboard new workloads
- Periodic resource allocation processes manage OBQs from the resource usage data collected during the onboarding queue runs
- Dashboards to monitor resource utilization and workflow runtime SLO performance
The first step to programmatically migrating workloads from Monarch to Moka at scale is resource allocation. Most of the items listed above can be re-used as-is or easily extended to support Moka. However, there are a few additional items that we would need to support resource allocation on Moka:
- A scheduler that is application-aware, allows managing cluster resources as hierarchical queues, and supports preemption
- A pipeline to export resource usage information for all applications run on Moka and ingestion pipelines to generate insight tables
- A resource allocation job that uses the insights tables to generate the OBQ resource allocation
- An orchestration layer that is able to route workloads to their target clusters and OBQs
Now let’s go into detail on how we solved these four missing pieces.
Scheduler
The default K8s scheduler is a “jack of all trades, master of none” solution that isn’t particularly adept at scheduling batch processing workloads. For our resource scheduling needs, we needed a scheduler that supports hierarchical queues and is able to schedule on a per-application and per-user basis instead of per-pod basis.
Apache YuniKorn is designed by a group of engineers with deep experience working on Apache Hadoop YARN and K8s. Apache YuniKorn not only recognizes users, applications, and queues, but also includes many other factors, such as ordering, priorities, and preemption when making scheduling decisions.
Given that Apache YuniKorn has the most attributes we need, we decided to use it in Moka.
Resource Usage Insight Information
As mentioned earlier, application resource usage history is critical for how we do resource allocation. At a high level, we use the historical usage of a queue as a baseline to estimate how much should be allocated going forward in each iteration of the allocation process. However, when we made the decision to first adopt Apache YuniKorn it was missing this vital feature. Apache YuniKorn was entirely stateless and only tracked instantaneous resource utilization across the cluster.
We needed a solution that would be able to reliably track resource consumption of all pods belonging to an application and was fault tolerant. For this, we worked closely with the Apache YuniKorn community to add support for logging resource usage information for finished applications (see YUNIKORN-1385 for more information). This feature aggregates pod resource usage per application and reports a final resource usage summary upon application completion.
This summary is logged to Apache YuniKorn’s stdout where we use Fluent Bit to filter out the app summary logs and write them to S3.
By design, the Apache YuniKorn application summary contains similar records as YARN’s application summary produced by YARN ResourceManager (see more details in this document) so that it would fit seamlessly into existing consumers of YARN application summaries.
In addition to application resource usage information, the following mapping information is automatically ingested into the insight tables:
- Workflow to project, tier, SLO
- Project to owner
- Owner to company organization
- Workflow job to applications
This information is used for associating workloads with their target queue and estimating the queue’s future resource requirements.
Figure 1 shows the information ingestion and resource allocation flow.
Figure 1: Org-based-queue data ingestion and resource allocation process
Resource Allocation Algorithm Improvement
To learn more about the resource allocation algorithm, see our previous blog post: Efficient Resource Management at Pinterest’s Batch Processing Platform.
Our algorithm prioritizes resource allocation to tier 1 and tier 2 queues up to a specified percentile threshold of the queue’s required resources.
One downside to this approach is that it often requires several iterations of resource allocation to converge to a stable one, where each iteration requires some manual tuning of parameters.
As part of the Moka migration, we designed and implemented a brand new algorithm that leverages Constraint Programming Using CP-SAT from the OR-Tools open source suite for optimization. This tool generates a model by constructing a collection of constraints based on the usage/capacity ratio gap between high tier (tier 1 and 2) and low tier (tier 3) resource requests. This new resource allocation algorithm runs faster and more reliably without human intervention.
OBQ and CCR Routing
Our job submission layer, Archer, is responsible for handling all job submissions to Moka. Archer provides flexibility at the platform level to analyze, modify and route jobs at submission time. This includes routing jobs to specific Moka clusters and queues.
Figure 2 shows how we do resource allocation with CCR for select jobs and the deployment process. The resource allocation change made at git repo is automatically submitted to Archer, and Archer talks to k8s to deploy the changed resource allocation config map, and then route jobs at runtime based on the CCR rules set up in the Archer Routing DB.
We plan to cover Archer and Moka in future blog posts.
Figure 2: Moka resource allocation and deployment process
Some Critical Apache YuniKorn Features and Fixes
In addition to YUNIKORN-1385, here are some other features and fixes we contributed back to the Apache YuniKorn community.
- YUNIKORN-790: Adds support for maxApplications to limit the number of applications that can run concurrently in any queue
- YUNIKORN-1988: Prevents preemption when a queue’s resource usage is lower than its guaranteed capacity
- YUNIKORN-2030: Fixes a bug when checking headroom which causes Apache YuniKorn to stall
Apache YuniKorn is still a relatively young open source project, and we are continuing to work with the community together to enrich its functionality and improve its reliability and efficiency.
Workflow SLO Monitoring
Once we started to onboard real production workloads to Moka, we extended our existing Workflow SLO Performance Dashboards for Monarch to include the daily runtime results of apps running on Moka. Our goal is to ensure at least 90% of tier 1 workflows meet their SLO at least 90% of time.
Figure 3: Workflow SLO performance analytics
Future Work
Despite having made great progress building out the Moka platform and migrating jobs from our legacy platform there are still many improvements we have planned. To give you a teaser of what’s to come:
We are in the process of designing a stateful service that is able to leverage YUNIKORN-1628 and YUNIKORN-2115 which introduce event streaming support.
In addition, we are working on a full-featured resource management console to manage the platform resources. This console will enable platform administrators to monitor cluster and queue resource usage in real time and allows custom load balancing between clusters.
Acknowledgment
First of all, thanks to the Apache YuniKorn community for their assistance and collaboration when we were evaluating Apache YuniKorn and for working with us to submit patches we made internally back to the open source project.
Next, thanks to our brilliant teammates, Rainie Li, Hengzhe Guo, Soam Acharya, Bogdan Pisica, Aria Wang from the Batch Processing Platform and Data Processing Infra teams for their work on Moka. Thanks Ang Zhang, Jooseong Kim, Chen Qin, Soam Acharya, Chunyan Wang et al for their support and insights while we were working on the project.
Last but not least, thank you to the Workflow Platform team and all of the client teams of our platform for their feedback.
References
To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.