pincompute: A Kubernetes Backed General Purpose Compute Platform for Pinterest

Harry Zhang, Jiajun Wang, Yi Li, Shunyao Li, Ming Zong, Haniel Martino, Cathy Lu, Quentin Miao, Hao Jiang, James Wen, David Westbrook | Cloud Runtime Team

Image Source: https://unsplash.com/photos/ZfVyuV8l7WU

Overview

Modern compute platforms are foundational to accelerating innovation and running applications more efficiently. At Pinterest, we are evolving our compute platform to provide an application-centric and fully managed compute API for the 90th percentile of use cases. This will accelerate innovation through platform agility, scalability, and a reduced cost of keeping systems up to date, and will improve efficiency by running our users’ applications on Kubernetes-based compute. We refer to this next generation compute platform as PinCompute, and our multi-year vision is for PinCompute to run the most mission critical applications and services at Pinterest.

PinCompute aligns with the Platform as a Service (PaaS) cloud computing model, in that it abstracts away the undifferentiated heavy lifting of managing infrastructure and Kubernetes and enables users to focus on the unique aspects of their applications. PinCompute evolves Pinterest architecture with cloud-native principles, including containers, microservices, and service mesh, reduces the cost of keeping systems up to date by providing and managing immutable infrastructure, operating system upgrades, and graviton instances, and delivers costs savings by applying enhanced scheduling capabilities to large multi-tenant Kubernetes clusters, including oversubscription, bin packing, resource tiering, and trough usage.

In this article, we discuss the PinCompute primitives, architecture, control plane and data plane capabilities, and showcase the value that PinCompute has delivered for innovation and efficiency at Pinterest.

Architecture

PinCompute is a regional Platform-as-a-Service (PaaS) that builds on top of Kubernetes. PinCompute’s architecture consists of a host Kubernetes cluster (host cluster) and multiple member Kubernetes clusters (member clusters). The host cluster runs the regional federation control plane, and keeps track of workloads in that region. The member clusters are zonal, and are used for the actual workload executions. Each zone can have multiple member clusters, which strictly aligns with the failure domain defined by the cloud provider, and clearly defines fault isolation and operation boundaries for the platform to ensure availability and control blast radius. All member clusters share a standard Kubernetes setup across control plane and data plane capabilities, and they support heterogeneous capabilities such as different workload types and hardware selections. PinCompute is multi-tenant, where a variety of types of workloads from different teams and organizations share the same platform. The platform provides needful isolations to ensure it can be shared across tenants securely and efficiently.

Figure 1: High Level Architecture of PinCompute

Users access the platform via Compute APIs to perform operations on their workloads. We leverage Custom Resources (CR) to define the kinds of workloads supported by the platform, and the platform offers a range of workload orchestration capabilities which supports both batch jobs and long running services in various forms. When a workload is submitted to the platform, it first gets persisted with the host cluster’s Kubernetes API. The federation control plane will then kick in to perform workload management tasks needed at the regional level, including quota enforcement, workload sharding, and member cluster selection. Then, the workload shards get propagated to member clusters for execution. The member cluster control plane consists of a combination of in-house and open source operators that are responsible for orchestrating workloads of different kinds. The federation control plane also collects execution statuses of workloads from their corresponding member clusters and aggregates them to be consumable via PinCompute APIs.

Figure 2: Workflow for Execution and Status Aggregation of PinCompute

PinCompute Primitives

Figure 3: Workload architecture on PinCompute

PinCompute primitives serve heterogeneous workloads across Pinterest, from long running, run-to-finish, ML training, scheduled run, and more. These use cases are essentially divided into three categories: (1) general purpose compute and service deployment, (2) run-to-finish jobs, and (3) infrastructure services. Pinterest run-to-finish jobs and infrastructure services are supported by existing Kubernetes native and Pinterest-specific resources, and with our latest thoughts on how to define simple, intuitive and extendable compute primitives, PinCompute introduces a new set of primitives for general purpose compute and service deployment. These primitives include PinPod, PinApp, and PinScaler.

PinPod is the basic building block for general purpose compute at Pinterest. Like the native Kubernetes Pod, PinPod inherits the Pod’s essence of being a foundational building block while providing additional Pinterest-specific capabilities. This includes features like per container updates, managed sidecars, data persistence, failovers, and more that allow PinPod to be easily leveraged as a building block under various production scenarios at Pinterest. PinPod is designed to create a clear divide between application and infrastructure teams, while still retaining the light-weighted nature of running containers. It solves many existing pain points: for example, the per container update can speed up application rolling updates, reduce resource consumption, and eliminate disturbance to user containers during infra sidecar upgrades.

PinApp is an abstraction that provides the best way to run and manage long running applications at Pinterest. By leveraging PinPod as an application replica, PinApp inherits all the integrations and best practices about software delivery from PinPod. Thanks to the federation control plane, PinApp offers a set of built-in orchestration capabilities to fulfill common distributed application management requirements, which includes zone-based rollouts and balancing zonal capacity. PinApp supports the functionality offered by Kubernetes native primitives such as Deployments and ReplicaSets but also includes extensions like deployment semantics to meet business needs and enhance manageability.

PinScaler is an abstraction that supports application auto scaling at Pinterest. It is integrated with Statsboard, Pinterest’s native metrics dashboard, allowing users to configure application-level metrics with desired thresholds to trigger scaling along with scaling safeguards, such as a cool down window and replica min/max limitations. PinScaler supports simple scaling with CPU and memory metrics, as well as scheduled scaling and custom metrics to support various production scenarios.

Figure 4: PinCompute Primitives: PinPod, PinApp, and PinScaler. PinPod operates as an independent workload, and also a reusable building block for the higher-order primitive PinApp. PinScaler automatically scales PinApp.

Returning to the bigger picture, PinCompute leverages the next generation primitives (PinPod, PinApp, PinScaler), building blocks from native Kubernetes and open source communities, along with deep integrations with federation architecture to provide the following categories of use cases:

(1) General purpose compute and service deployment: This is handled by PinCompute’s new primitive types. PinApp and PinScaler help long-running stateless services deploy and scale quickly. PinPod functions as a general purpose compute unit and is currently serving Jupyter Notebook for Pinterest developers.

(2) Run-to-finish jobs: PinterestJobSet leverages Jobs to provide users a mechanism to execute run-to-finish, framework-less parallel processings; PinterestTrainingJob leverages TFJob and PyTorchJob from the Kubeflow community for distributed training; PinterestCronJob leverages CronJob to execute scheduled jobs based on cron expressions.

(3) Infrastructure services: We have PinterestDaemon leveraging DaemonSet, and a proprietary PinterestSideCar to support different deploy modes of infrastructure services. Components that are able to be shared by multiple tenants (e.g. logging agent, metrics agent, configuration deployment agent) are deployed as PinterestDaemons, which ensures one copy per node, shared by all Pods on that node. Those that cannot be shared will leverage PinterestSideCar and will be deployed as sidecar containers within user Pods.

The PinCompute primitives enable Pinterest developers to delegate infrastructure management and the associated concerns of troubleshooting and operations, allowing them to concentrate on evolving business logics to better serve Pinners.

Accessing PinCompute

Users access PinCompute primitives via PinCompute’s Platform Interfaces, which consists of an API layer, a client layer for the APIs, and the underlying services/storages that support those APIs.

Figure 5: High level architecture of PinCompute Platform Interface layer

PinCompute API

PinCompute API is a gateway for users to access the platform. It provides three groups of APIs: workload APIs, operation APIs, and insight APIs. Workload APIs contains methods to perform CRUD actions on compute workloads, debugging APIs provide mechanisms such as stream logs or open container shells to troubleshoot live workloads, and insight APIs provide users with runtime information such as application state change and system internal events to help users to understand the state of their existing and past workloads.

Why PinCompute API

Introducing PinCompute API on top of raw Kubernetes APIs has many benefits. First, as PinCompute federates many Kubernetes clusters, PinCompute API integrates user requests with federation and aggregates cross-cluster information to form a holistic user-side view of the compute platform. Second, PinCompute API accesses Kubernetes API efficiently. For example, it contains a caching layer to serve read APIs efficiently, which offloads expensive list and query API calls from Kubernetes API server. Finally, as a gateway service, PinCompute API ensures uniformed user experience when accessing different PinCompute backend services such as Kubernetes, node service, insights service, project governance services, etc.

Figure 6: PinCompute API data flow

Integrating With Pinterest Infrastructure

This layer incorporates Pinterest’s infrastructure capabilities like rate limiting and security practices to simplify the Kubernetes API usage and provide a stable interface for our API consumers and developers. The PinCompute API implements rate limiting mechanisms to ensure fair resource usage leveraging our Traffic team’s rate limiting sidecar, benefiting from reusable Pinterest components. PinCompute API is also fully integrated with Pinterest’s proprietary security primitives to ensure authentication, authorization, and auditing to follow paved paths. Such integration enables us to provide Pinterest developers with unified access control experience with granularity at API call and API resource level. These integrations are critical for PinCompute APIs to be reliable, secure, and compliant.

Enhanced API Semantics

PinCompute API provides enhanced API semantics on top of the Kubernetes API to improve the user experience. One important enhancement PinCompute API does is that it presents the raw Kubernetes data model in a simplified way with only information relevant to building software at Pinterest, which not only reduces the infrastructure learning curve for developers who focus on building high level application logics, but also improved data efficiency for API serving. For example, removing managed fields will reduce up to 50% data size for PinCompute API calls. We also designed the APIs in a way that is more descriptive for use cases such as pause, stop, restart-container, etc., which are intuitive and easy to use in many scenarios. PinCompute provides OpenAPI documentation and auto generated clients, documentation and SDKs to help users self-serve building applications on PinCompute.

PinCompute SDK

We strategically invest in building an SDK for clients to standardize access to PinCompute. With the SDK, we are able to encapsulate best practices such as error handling, retry with backoff, logging, and metrics as reusable building blocks, and ensure these best practices are always applied to a client. We also publish and manage versioned SDKs with clear guidance on how to develop on top of the SDK. We closely work with our users to ensure the adoption of the latest and greatest versions of the SDK for optimized interactions with PinCompute.

Managing Resources in PinCompute

PinCompute supports three resource tiers: Reserved, OnDemand, and Preemptible. Users define the resource quota of their projects for each tier. Reserved tier quotas are backed by a fixed-size resource pool and a dedicated workload scheduling queue, which ensures scheduling throughput and capacity availability. OnDemand tier quotas leverage a globally shared, and dynamically sized resource pool, serving workloads in a first-come, first-serve manner. Preemptible tier is being developed to make opportunistic usage of unused Reserved tier and OnDemand tier capacity, which would get reclaimed when needed by their corresponding tiers. PinCompute clusters are also provisioned with a buffer space consisting of active but unused resources to accommodate workload bursts. The following diagram illustrates the resource model of PinCompute.

Figure 7: PinCompute resource model

Scheduling Architecture

PinCompute consists of two layers of scheduling mechanisms to ensure effective workload placements. Cluster level scheduling is performed in PinCompute’s regional federation control plane. Cluster level scheduling takes a workload and picks one or more member clusters for execution. During cluster level scheduling, the workload is first passed through a group of filters that filter out clusters that cannot fit, and then leverage a group of score calculators to rank candidate clusters. Cluster level scheduling ensures high level placement strategy and resources requirements are satisfied, and also takes factors such as load distribution, cluster health, etc., into consideration to perform regional optimizations. Node level scheduling happens inside member clusters, where workloads are converted to Pods by the corresponding operators. After Pods are created, a Pod scheduler is used to place Pods onto nodes for execution. PinCompute’s Pod scheduler leverages Kubernetes’s scheduler framework, with a combination of upstream and proprietary plugins to ensure the scheduler supports all features available in open source Kubernetes, but at the same time is optimized to PinCompute’s specific requirements.

Figure 8: PinCompute scheduling architecture

PinCompute Cost Efficiency

Cost efficiency is critical to PinCompute. We have enacted various methods to successfully drive down PinCompute infrastructure cost without compromising on the user experience.

We promote multi-tenancy usage by eliminating unnecessary resource reservation and migrating user workloads to the on-demand resource pool that is shared across the federated environment. We collaborated with major platform users to smoothen their workload submission pattern to avoid oversubscription in resources. We also started a platform-level initiative to switch GPU usage from P4 family instances to the cost-performant alternatives (i.e. G5 family). The following diagram demonstrates the trend of PinCompute GPU cost vs. capacity, where we successfully reduced cost while supporting the growing business.

Figure 9: PinCompute GPU cost vs. capacity

Moving forward, there are several on-going projects in PinCompute to further enhance cost efficiency. 1) We will introduce preemptable workloads to encourage more flexible resource sharing. 2) We will enhance the platform resource tiering and workload queueing mechanisms to make smarter decisions with balanced tradeoff on fairness and efficiency when scheduling user workloads.

PinCompute Node Runtime

Node architecture is a critical space where we invested heavily to ensure applications are able to run on a containerized, multi-tenanted environment securely, reliably, and efficiently.

Figure 10: High level architecture of PinCompute Node and infrastructure integrations

Pod in PinCompute

Pod is designed to isolate tenants on the node. When a Pod is launched, it is granted its own network identity, security principal, and resource isolation boundary atomically, which are immutable during a Pod’s lifecycle.

When defining containers inside a Pod, users can specify two lifecycle options: main container and sidecar container. Main containers will honor Pod level restart policy, while sidecar containers are ensured to be available as long as main containers need to run. In addition, users can enable start up and termination ordering between sidecar and main containers. Pod in PinCompute also supports per container update, with which containers can be restarted with new spec in a Pod without requiring the Pod to be terminated and launched again. Sidecar container lifecycle and per container update are critical features for batch job execution reliability, and service deployment efficiency.

PinCompute has a proprietary networking plugin to support a variety of container networking requirements. Host network is reserved for system applications only. “Bridge Port” assigns a node-local, non-routable IP to Pods that do not need to serve traffic. For Pods that need to serve traffic, we provide “Routable IP” allocated from a shared network interface, or Pod can request a “Dedicated ENI” for full network segmentation. Network resources such as ENI and IP allocations are holistically managed through cloud resource control plane, which ensures management efficiently.

PinCompute supports a variety of volumes including EmptyDir, EBS, and EFS. Specifically, we have a proprietary volume plugin for logging, which integrates with in-house logging pipelines to ensure efficient and reliable log collections.

Integrating With Pinterest Infrastructure

PinCompute node contains critical integration points between user containers and Pinterest’s infrastructure ecosystem, namely, security, traffic, configuration, logging and observability. These capabilities have independent control planes that are orthogonal to PinCompute, and therefore are not limited to any “Kubernetes cluster” boundary.

Infrastructure capabilities are deployed in three manners: host-level daemon, sidecar container, or with a dual mode. Daemons are shared by all Pods running on the node. Logging, metrics, and configuration propagation are deployed as daemons, as they do not need to leverage Pod’s tenancy or stay in the critical data paths of the applications running in the Pod. Sidecar containers operate within Pod’s tenancy and are leveraged by capabilities that rely on Pod’s tenancy or need performance guarantees such as traffic and security.

User containers interact with infrastructure capabilities such as logging, configuration, service discovery through file system sharing, and capabilities such as traffic and metrics through networking (local host or unix domain socket). Pod, along with the tenancy definition we have, ensures various infrastructure capabilities can be integrated in a secure and effective manner.

Enhanced Operability

PinCompute node has a proprietary node management system that enhances visibility and operability of nodes. It contains node level probing mechanisms to deliver supplementary signals for node health which covers areas such as container runtime, DNS, devices, various daemons, etc. These signals serve as a node readiness gate to ensure new nodes are schedulable only after all capabilities are ready, and are also used during application runtime to assist automation and debugging. As part of node quality of service (QoS), when a node is marked for reserved tier workloads, it can provide enhanced QoS management such as configuration pre-downloading or container image cache refresh. Node also exposes runtime APIs such as container shells and live log streaming to help users troubleshoot their workloads.

Figure 11: PinCompute’s proprietary node management system

Managing PinCompute Infrastructure

Automation has a large return on investment when it comes to minimizing human error and boosting productivity. PinCompute integrates a range of proprietary services aimed at streamlining daily operations.

Automatic Remediation

Operators are often troubled with trivial node health issues. PinCompute is equipped to self-remediate these issues with an automatic remediation service. Health probes operating on the Node Manager detect node complications and mark them via specific signal annotations. This signal is monitored and interpreted into actions. Then the remediation service executes actions such as cordoning or terminating. The components for detection, monitoring, and the remediation service align with principles of decoupling and extensibility. Furthermore, deliberate rate limiting and circuit-breaking mechanisms are established providing a systematic approach to node health management.

Figure 12: PinCompute Automatic Remediation Architecture

Application Aware Cluster Rotation

The primary function of the PinCompute Upgrade service is to facilitate the rotations of Kubernetes clusters in a secure, fully automated manner while adhering to both PinCompute platform SLOs and user agreements concerning rotation protocol and graceful termination. When processing cluster rotation, concerns range from the sequence of rotating different types of nodes, simultaneous rotations of nodes, nodes rotated in parallel or individually, and the specific timings of node rotations. Such concerns arise from the diverse nature of user workloads running on the PinCompute platform. Through the PinCompute Upgrade service, platform operators can explicitly dictate how they would like cluster rotations to be conducted. This configuration allows for a carefully managed automatic progression.

Release PinCompute

The PinCompute release pipeline is constituted by four stages, each of them being an individual federated environment. Changes are deployed through stages and verified before promoting. An end-to-end test framework operates continuously on PinCompute to authenticate platform accuracy. This framework emulates a genuine user, and functions as a constant canary to oversee the platform’s correctness.

Figure 13: PinCompute Release Procedure

Machine Image (AMI) Management

PinCompute selectively offers a finite set of node types, taking into account user needs of hardware families, manageability and cost-effectiveness. The AMIs responsible for bootstrapping these nodes fall into three categories: general-purpose AMIs, machine learning focused AMI, and a customizable AMI. The concept of inheriting from a parent AMI and configuration simplifies their management considerably. Each AMI is tagged according to type and version, and they utilize the Upgrade service to initiate automatic deployments.

Operation and User Facing Tools

In PinCompute, we provide a set of tools for platform users and administrators to easily operate the platform and the workloads running on it. We built a live-debugging system to provide end users with UI-based container shells to debug inside their Pods, as well as stream console logs and file-based logs to understand the progress of their running applications. This tool leverages proprietary node level APIs to decouple user debugging from critical control paths such as Kubernetes API and Kubelet, and ensures failure isolation and scalability. Self-service project management along with step-by-step tutorials also reduced user’s overhead to onboard new projects or make adjustments of properties of existing projects such as resource quota. PinCompute’s cluster management system provides an interactive mechanism for editing cluster attributes which makes it handy to iterate new hardwares or adjust capacity settings. The easy-to-use tool chains ensure efficient and scalable operations and over the time greatly improved user experiences of the platform.

Scalability and SLOs

PinCompute is designed to support the compute requirements at Pinterest scale. Scalability is a complex goal to achieve, and to us, each of PinCompute’s Kubernetes cluster is optimized towards a sweet spot with 3000 nodes, 120k pods, and 1000 mutating pod operations per minute, with a 25sec P99 workload end to end launch latency. These scaling targets are defined by the requirements of most applications at Pinterest, and are results of balancing across multiple factors including cluster size, workload agility, operability, blast radius and efficiency. This scaling target makes each Kubernetes cluster a solid building block for overall compute, and PinCompute’s architecture can horizontally scale by adding more member clusters to ensure enough scalability for the continuous growth of PinCompute footprint.

PinCompute defines its SLOs in two forms: API availability and platform responsiveness. PinCompute ensures 99.9% availability of its critical workload orchestration related APIs. PinCompute offers SLO in control plane reconcile latency which focuses on the latency for the system to take action. Such latency varies from seconds to 10s seconds based on workload complexity and corresponding business requirements. For reserved tier quality of service, PinCompute provides SLO for workload end to end launch speed, which does not only focus on platform’s taking action, but also includes how fast such actions can take effect. Those SLOs are important signals for platform level performance and availability, and also sets high standards for platform developers to iterate platform capabilities with high quality.

Learnings and Future Work

Over the past few years, we have matured the platform both in its architecture as well as a set of capabilities Pinterest requires. Introducing compute as Platform as a Service (PaaS) has been seen as the biggest win for Pinterest developers. An internal research showed that > 90% use cases with > 60% infrastructure footprint can benefit from leveraging a PaaS to iterate their software. As platform users, PaaS abstracts away the undifferentiated heavy lifting of owning and managing infrastructure and Kubernetes, and enables them to focus on the unique aspects of their applications. As platform operators, PaaS enables holistic infrastructure management through standardization, which provides opportunities to enhance efficiency and reduce the cost of keeping infrastructure up-to-date. PinCompute embraces “API First” which defines a crisp support contract and makes the platform programmable and extendable. Moreover, a solid definition of “tenancy” in the platform establishes clear boundaries across use cases and their interactions with infrastructure capabilities, which is critical to the success of a multi-tenanted platform. Last but not least, by doubling down on automation, we were able to improve support response time and reduce team KTLO and on-call overhead.

There are a lot of exciting opportunities as PinCompute keeps growing its footprint in Pinterest. Resource management and efficiency is a big area we are working on; projects such as multi-tenant cost attribution, efficient bin packing, autoscaling and capacity forecast are critical to support an efficient and accountable infrastructure in Pinterest. Orchestrating stateful applications is both technically challenging and important to Pinterest business, and while PinPod and PinApp are providing solid foundations to orchestrate applications, we are actively working with stakeholders of stateful systems on shareable solutions to improve operational efficiency and reduce maintenance costs. We also recognize the importance of use cases being able to access Kubernetes API. As Kubernetes and its communities are actively evolving, it is a big benefit to follow industry trends and adopt industry standard practices, and therefore we are actively working with partner teams and vendors to enable more Pinterest developers to do so. Meanwhile, we are working on contributing back to the community, as we believe a widely trusted community is the best platform to build a shared understanding, contribute features and improvements, and share and absorb wins and learnings in production for the good of all. Finally, we are evaluating opportunities to leverage managed services to further offload infrastructure management to our cloud provider.

Acknowledgements

It has been a multi-year effort to evolve PinCompute to enable multiple use cases across Pinterest. We’d like to acknowledge the following teams and individuals who closely worked with us in building, iterating, productizing, and improving PinCompute:

ML Platform: Karthik Anantha Padmanabhan, Chia-Wei Chen
Workflow Platform: Evan Li, Dinghang Yu
Online Systems: Ping Jin, Zhihuang Chen
App Foundation: Yen-Wei Liu, Alice Yang
Ads Delivery Infra: Huiqing Zhou
Traffic Engineering: Scott Beardsley, James Fish, Tian Zhao
Observability: Nomy Abbas, Brian Overstreet, Wei Zhu, Kayla Lin
Continuous Delivery Platform: Naga Bharath Kumar Mayakuntla, Trent Robbins, Mitch Goodman
Platform Security: Cedric Staub, Jeremy Krach
TPM — Governance and Platforms: Anthony Suarez, Svetlana Vaz Menezes Pereira

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.