Dynamic Kubernetes Cluster Scaling at Airbnb
Authors: Evan Sheng, David Morrison
Introduction
An important part of running Airbnb’s infrastructure is ensuring our cloud spending automatically scales with demand, both up and down. Our traffic fluctuates heavily every day, and our cloud footprint should scale dynamically to support this.
To support this scaling, Airbnb utilizes Kubernetes, an open source container orchestration system. We also utilize OneTouch, a service configuration interface built on top of Kubernetes, and is described in more detail in a previous post.
In this post, we’ll talk about how we dynamically size our clusters using the Kubernetes Cluster Autoscaler, and highlight functionality we’ve contributed to the sig-autoscaling community. These improvements add customizability and flexibility to meet Airbnb’s unique business requirements.
Kubernetes Clusters at Airbnb
Over the past few years, Airbnb has shifted almost all online services from manually orchestrated EC2 instances to Kubernetes. Today, we run thousands of nodes across nearly a hundred clusters to accommodate these workloads. However, this change didn’t happen overnight. During this migration, our underlying Kubernetes cluster setup evolved and became more sophisticated as more workloads and traffic shifted to our new technology stack. This evolution can be split into three stages.
Stage 1: Homogenous Clusters, Manual Scaling
Stage 2: Multiple Cluster Types, Independently Autoscaled
Stage 3: Heterogeneous Clusters, Autoscaled
Stage 1: Homogenous Clusters, Manual Scaling
Before using Kubernetes, each instance of a service was run on its own machine, and manually scaled to have the proper capacity to handle traffic increases. Capacity management varied per team and capacity would rarely be un-provisioned once load dropped.
Our initial Kubernetes cluster setup was relatively basic. We had a handful of clusters, each with a single underlying node type and configuration, which ran only stateless online services. As some of these services began shifting to Kubernetes, we started running containerized services in a multi-tenant environment (many pods on a node). This aggregation led to fewer wasted resources, and consolidated capacity management for these services to a single control point at the Kuberentes control plane. At this stage, we scaled our clusters manually, but this was still a marked improvement over the previous situation.
Figure 1: EC2 Nodes vs Kubernetes Nodes
Stage 2: Multiple Cluster Types, Independently Autoscaled
The second stage of our cluster configuration began when more diverse workload types, each with different requirements, sought to run on Kubernetes. To accommodate their needs, we created a cluster type abstraction. A “cluster type” defines the underlying configuration for a cluster, meaning that all clusters of a cluster type are identical, from node type to different cluster component settings.
More cluster types led to more clusters, and our initial strategy of manually managing capacity of each cluster quickly fell apart. To remedy this, we added the Kubernetes Cluster Autoscaler to each of our clusters. This component automatically adjusts cluster size based on pod requests — if a cluster’s capacity is exhausted, and a pending pod’s request could be filled by adding a new node, Cluster Autoscaler launches one. Similarly, if there are nodes in a cluster that have been underutilized for an extended period of time, Cluster Autoscaler will remove these from the cluster. Adding this component worked beautifully for our setup, saved us roughly 5% of our total cloud spend, and the operational overhead of manually scaling clusters.
Figure 2: Kubernetes Cluster Types
Stage 3: Heterogeneous Clusters, Autoscaled
When nearly all online compute at Airbnb shifted to Kubernetes, the number of cluster types had grown to over 30, and the number of clusters to 100+. This expansion made Kubernetes cluster management tedious. For example, cluster upgrades had to be individually tested on each of our numerous cluster types.
In this third phase, we aimed to consolidate our cluster types by creating “heterogeneous” clusters that could accommodate many diverse workloads with a single Kubernetes control plane. First, this greatly reduces cluster management overhead, as having fewer, more general purpose clusters reduces the number configurations to test. Second, with the majority of Airbnb now running on our Kubernetes clusters, efficiency in each cluster provides a big lever to reduce cost. Consolidating cluster types allows us to run varied workloads in each cluster. This aggregation of workload types — some big and some small — can lead to better bin packing and efficiency, and thus higher utilization. With this additional workload flexibility, we had more room to implement sophisticated scaling strategies, outside of the default Cluster Autoscaler expansion logic. Specifically, we aimed to implement scaling logic that was tied to Airbnb specific business logic.
Figure 3: A heterogeneous Kubernetes cluster
As we scaled and consolidated clusters so they were heterogeneous (multiple instance types per cluster), we began to implement specific business logic during expansion and realized some changes to the autoscaling behavior were necessary. The next section will describe some of the changes we’ve made to Cluster Autoscaler to make it more flexible.
Cluster Autoscaler Improvements
The most significant improvement we made to Cluster Autoscaler was to provide a new method for determining node groups to scale. Internally, Cluster Autoscaler maintains a list of node groups which map to different candidates for scaling, and it filters out node groups that do not satisfy pod scheduling requirements by running a scheduling simulation against the current set of Pending (unschedulable) pods. If there are any Pending (unschedulable) pods, Cluster Autoscaler attempts to scale the cluster to accommodate these pods. Any node groups that satisfy all pod requirements are passed to a component called the Expander.
Figure 4: Cluster Autoscaler and Expander
The Expander is responsible for further filtering the node groups based on operational requirements. Cluster Autoscaler has a number of different built-in expander options, each with different logic. For example, the default is the random expander, which selects from available options uniformly at random. Another option,and the one that Airbnb has historically used, is the priority expander, which chooses which node group to expand based on a user-specified tiered priority list.
As we moved toward our heterogeneous cluster logic, we found that the default expanders were not sophisticated enough to satisfy our more complex business requirements around cost and instance type selection.
As a contrived example, say we want to implement a weighted priority expander. Currently, the priority expander only lets users specify distinct tiers of node groups, meaning it will always expand tiers deterministically and in order. If there are multiple node groups in a tier, it will break ties randomly. A weighted priority strategy of setting two node groups in the same tier, but expanding one 80% of the time, and another 20% of the time, is not achievable with the default setup.
Outside of the limitations of the current supported expanders, there were a few operational concerns:
- Cluster Autoscaler’s release pipeline is rigorous and changes take time to review before being merged upstream. However, our business logic and desired scaling strategy is continuously changing. Developing an expander to fill our current needs today may not fulfill our needs in the future
- Our business logic is specific to Airbnb and not necessarily other users. Any changes we implement specific to our logic would not be useful to contribute back upstream
From these, we came up with a set of requirements for a new expander type in Cluster Autoscaler:
- We wanted something that was both extensible and usable by others. Others may run into similar limitations with the default Expanders at scale, and we would like to provide a generalized solution and contribute functionality back upstream
- Our solution should be deployable out of band with Cluster Autoscaler, and allow us to respond more rapidly to changing business needs
- Our solution should fit into the Kubernetes Cluster Autoscaler ecosystem, so that we do not have to maintain a fork of Cluster Autoscaler indefinitely
With these requirements, we came up with a design that breaks out the expansion responsibility from the Cluster Autoscaler core logic. We designed a pluggable “custom Expander.” which is implemented as a gRPC client (similarly to the custom cloud provider). This custom expander is broken into two components.
The first component is a gRPC client built into Cluster Autoscaler. This Expander conforms to the same interface as other Expanders in Cluster Autoscaler, and is responsible for transforming information about valid node groups from Cluster Autoscaler to a defined protobuf schema (shown below), and receives the output from the gRPC server to transform back to a final list of options for Cluster Autoscaler to scale up.
service Expander { rpc BestOptions (BestOptionsRequest) returns (BestOptionsResponse)
}
message BestOptionsRequest { repeated Option options; map<string, k8s.io.api.core.v1.Node> nodeInfoMap;
}
message BestOptionsResponse { repeated Option options;
}
message Option { // ID of node to uniquely identify the nodeGroup string nodeGroupId; int32 nodeCount; string debug; repeated k8s.io.api.core.v1.Pod pod;
}
The second component is the gRPC server, which is left up to the user to write. This server is intended to be run as a separate application or service, which can run arbitrarily complex expansion logic when selecting which node group to scale up, with the given information passed from the client. Currently, the protobuf messages passed over gRPC are slightly transformed versions of what is passed to the Expander in Cluster Autoscaler.
From our aforementioned example, a weighted random priority expander can be implemented fairly easily by having the server read from a priority tier list and weighted percentage configuration from a configmap, and choose accordingly.
Figure 5: Cluster Autoscaler and Custom gRPC Expander
Our implementation includes a failsafe option. It is recommended to use the option to pass in multiple expanders as arguments to Cluster Autoscaler. With this option, if the server fails, Cluster Autoscaler is still able to expand using a fallback Expander.
Since it runs as a separate application, expansion logic can be developed out of band with Cluster Autoscaler, and since the gRPC server is customizable by the user based on their needs, this solution is extensible and useful to the wider community as a whole.
Internally, Airbnb has been using this new solution to scale all of our clusters without issues since the beginning of 2022. It has allowed us to dynamically choose when to expand certain node groups to meet Airbnb’s business needs, thus achieving our initial goal of developing an extensible custom expander.
Our custom expander was accepted into the upstream Cluster Autoscaler earlier this year, and will be available to use in the next version (v1.24.0) release.
Other Autoscaler Improvements
Over the course of our migration to heterogeneous Kubernetes clusters, we identified a number of other bugs and improvements that could be made to Cluster Autoscaler. These are briefly described below:
- Early abort for AWS ASGs with no capacity: Short circuit the Cluster Autoscaler loop to wait for nodes it tries to spin up to see if they are ready by calling out to an AWS EC2 endpoint to check if the ASG has capacity. With this change enabled, users get much more rapid, yet correct scaling. Previously, users using a priority ladder would have to wait 15 minutes between each attempted ASG launch, before trying an ASG of lower priority.
- Caching launch templates to reduce AWS API calls: Introduce a cache for AWS ASG Launch Templates. This change unlocks using large numbers of ASGs, which was critical for our generalized cluster strategy. Previously, for empty ASGs (no present nodes in a cluster), Cluster Autoscaler would repeatedly call an AWS endpoint to get launch templates, resulting in throttling from the AWS API.
Conclusion
In the last four years, Airbnb has come a long way in our Kubernetes Cluster setup. Having the largest portion of compute at Airbnb on a single platform provided a strong, consolidated lever to improve efficiency, and we are now focused on generalizing our cluster setup (think “cattle, not pets”). By developing and using a more sophisticated expander in Cluster Autoscaler (as well as fixing a number of other minor issues with the Autoscaler), we have been able to achieve our goals of developing our complex, business specific scaling strategy around cost and mixed instance types, while also contributing some useful features back to the community.
For more details on our heterogeneous cluster migration, watch our Kube-Con talk and we’re also at KubeCon EU this year, come talk to us! If you’re interested in working on interesting problems like the ones we’ve described here, we’re hiring! Check out these open roles:
Engineering Manager- Infrastructure
Senior Engineer, Cloud Infrastructure
Software Engineer, Observability
Software Engineer, Developer Infrastructure
Acknowledgements
The evolution of our Kubernetes Cluster setup is the work of many different collaborators. Special thanks to Stephen Chan, Jian Cheung, Ben Hughes, Ramya Krishnan, David Morrison, Sunil Shah, Jon Tai and Long Zhang, as this work would not have been possible without them.
****************
All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.