Apache Flink® on Kubernetes

Airbnb’s Use of A New Flink platform evolved from Apache Hadoop® Yarn

Introduction

At Airbnb, Apache Flink was introduced in 2018 as a supplementary solution for stream processing. It ran alongside Apache Spark™ Streaming for several years before transitioning to become the primary stream processing platform. In this blog post, we will delve into the evolution of Flink architecture at Airbnb and compare our prior Hadoop Yarn platform with the current Kubernetes-based architecture. Additionally, we will discuss the efforts undertaken throughout the migration process and explore the challenges that arose during this journey. In the end we will summarize the impact, learnings along the way and future plans.

Architecture Evolution

The evolution of Airbnb’s streaming processing architecture based on Apache Flink can be categorized into three distinct phases:

Phase One: Flink jobs operated on Hadoop Yarn with Apache Airflow serving as the job scheduler.

Around 2018, several teams at Airbnb adopted Flink as their streaming processing engine, mainly due to its superior low-latency capabilities compared to Spark Streaming. During this period, Flink jobs were running on Hadoop Yarn, and Airflow was employed as the workflow manager for task scheduling and dependency management.

The selection of Airflow as the workflow manager was largely influenced by its widespread use in addressing various job scheduling needs, as there were no other user-friendly open-source alternatives readily available at that time. Each team was responsible for handling their Airflow Directed Acyclic Graphs (DAGs), job source code, and the requisite dependency JARs. Typically, Flink JAR files were locally built before deployment to Amazon S3.

The architecture catered to our requirements during that period with a limited range of use cases.

From 2019 onwards, Apache Flink gained significant traction at Airbnb, replacing Spark Streaming as the primary stream processing platform. With the scaling in usage of Flink we encountered various challenges and limitations in this architecture. To begin with, Airflow’s batch-oriented design, relying on polling intervals, did not match Airbnb’s needs, and we experienced significant delays in job start and failure recovery, often causing SLA violations for low-latency use cases. Airflow also caused a singleton issue as duplicate job submissions occasionally occur due to race conditions among Airflow workers and user operations not following expected patterns. Besides, Airflow’s Directed Acyclic Graph (DAG) structure is complex and does not function well with some of Airbnb’s streaming use cases. We also encountered engineering context mismatch in this architecture: product engineers might find themselves unfamiliar with Apache Airflow and Hadoop, resulting in a steep learning curve when setting up new Apache Flink jobs.

To tackle the above technical and operational challenges, we started to explore new possibilities. Our initial step involved replacing Airflow with a customized lightweight streaming job scheduler, marking the inception of Phase Two.

Phase Two: Flink jobs operated on Hadoop Yarn, with a lightweight streaming job scheduler.

At a high level, Airflow was replaced by a lightweight streaming job scheduler operating on Kubernetes. The job scheduler contains a master node and a pool of worker nodes:

The master node is responsible for managing the metadata of all Flink jobs and ensuring the proper life cycle of each worker node. This includes tasks such as parsing user-provided job configurations, synchronizing metadata and job statuses with Apache Zookeeper™, and ensuring that worker nodes consistently maintain their expected states.

A worker node is responsible for handling the dependencies and life cycle of a single Flink job. Workers package the necessary dependencies, submit the Flink job to Hadoop Yarn, continuously monitor its status, and in the event of a failure, it triggers an immediate restart.

The Phase 2 design resulted in faster turnaround time and reduced downtime during job restarts. It also resolved single point of failure issues with Zookeeper.

As usage of Flink grew, we encountered new challenges in Phase Two:

Lack of CI/CD: Flink developers had to devise their own version control strategies.
Absence of native secrets management: There is no vanilla secrets management on Hadoop Yarn.
Limited resource and dependency isolation: Each supported Flink version had to be manually preinstalled on the Yarn cluster. While Yarn’s resource queues could provide some level of resource isolation, job-level isolation was absent.
Service Discovery complexity: As more use cases were onboarded, each potentially requiring access to various internal Airbnb services, configuring service access on Yarn proved to be cumbersome. It forced a binary choice between enabling service access for the entire cluster or none at all.
Monitoring and debugging challenges: Managing and maintaining the logging pipeline and SSH access became non-trivial tasks on a multi-tenant Yarn cluster.
Ongoing complexity and dependencies: Although the Flink job scheduler was lightweight compared to Airflow, it introduced additional complexities.

Phase Three (current state): Flink jobs run on Kubernetes, and the job scheduler is eliminated.

Deploying Flink on Kubernetes allows direct Flink deployment on a running Kubernetes cluster. With this integration we can explore enabling efficient autoscaling and the Kubernetes operator to simplify the management of Flink jobs and clusters.

Flink on Kubernetes offers several advantages over Hadoop Yarn addressing the above challenges:

Developer experience: Standardized by integrating with the existing CI/CD systems.
Secrets Management: With Flink on Kubernetes, each Flink job can securely store its own secrets within the pods. This provides a more secure way to manage sensitive information.
Isolated Environment: Jobs running on Flink on Kubernetes benefit from isolation at both the resource and dependency levels. Each job can run on its dedicated Flink version if supported by its image, allowing for better management of dependencies.
Enhanced Monitoring: Integration with Airbnb’s pre-defined logging and metric sidecars on Kubernetes simplifies setup and improves monitoring. This enables detailed insights into individual pods and rate limiting for logging per pod, making it easier to track and troubleshoot issues.
Service Discovery: Flink jobs now adhere to Airbnb’s standardized approach for service discovery, using the cluster mesh. This ensures consistent and reliable communication between services.
Simplified SSH access: Users with the appropriate permissions can now SSH into the Flink pod without the need for an SSH tunnel. This provides greater flexibility and control over SSH permissions per job.

Additionally, we’ve observed an increasing level of Kubernetes support and adoption within the Flink community, which increased our confidence in running Flink on Kubernetes.

It’s worth mentioning that Kubernetes brings its own risks and limitations. For instance, a single Flink task manager failover can lead to the pause of the entire job process. This can pose issues in scenarios with frequent node rotations within Kubernetes and large jobs deployed with hundreds of task managers. For context, node rotation on Kubernetes is performed to ensure the operability and stability of the cluster. It involves replacing existing nodes with new ones, typically with updated configurations or to perform maintenance tasks, with the goals of applying host configuration changes, maintaining node balance and enhancing operational efficiency. In comparison, node rotations on Yarn occur less frequently, so the impact on job availability is less significant. We will explore how we are mitigating these challenges in the Future Work section.

Components Deep Dive

Below is an overview of our current architecture:

To provide a better understanding of the system, below is a deep dive of the five primary components, as well as how users interact with them when setting up a new Flink job:

Job configurations: This serves as an abstraction layer over Kubernetes and CI/CD components, providing Flink users with a simplified interface for creating Flink application templates. It shields users from the complexities of the underlying Kubernetes infrastructure. Flink users define the core specifications of their Flink job via a configuration file. This includes critical information like the entrypoint class name, job parallelism, and the necessary ingress services and sinks.

Image management: This component involves the pre-construction of Flink base images, which are bundled with essential dependencies required to access Airbnb resources. These images are stored in Amazon Elastic Container Registry and can be readily deployed with user Jars or further customized to meet specific user needs.

CI/CD: By introducing a few customizations to support Flink’s stateful deployment, we’ve integrated Flink with our existing CI/CD system, providing a standardized version control and continuous delivery experience. Flink jobs are deployed within Kubernetes, each residing in its distinct namespace to ensure isolation and effective administration.

Flink portal: an API service that offers essential features for managing the states of Flink jobs. These features include stopping a Flink job with a savepoint and querying completed checkpoints on Amazon S3. Additionally, it provides a self-service UI portal, enabling users to monitor and check the status of their jobs. Users also gain access to critical job state management functionalities, empowering them to either initiate the job from a bootstrapped savepoint or resume it from a previous checkpoint.

Flink job runtime: Each Flink job is deployed as an independent application cluster on Kubernetes. To ensure fault tolerance and state storage, Zookeeper, ETCD, and Amazon S3 are utilized. Additionally, pre-configured sidecar containers accompany the Flink containers to provide support for critical functions such as logging, metrics, DNS, and more. A service mesh is employed to facilitate communication between Flink jobs and other microservices.

Impact

Onboarding Flink jobs is faster, where our developers noted that it takes hours instead of days, and developers can focus more on their application logic.

Improvement in Flink Job Availability and Latency

The architecture of Flink on Kubernetes improves job availability and scheduling latency by eliminating certain components of the Flink client and job scheduler found in Flink on Yarn.

Cost Savings in Infrastructure

The streamlining of Flink infrastructure complexity and the removal of certain components, such as the job scheduler, have resulted in cost savings in our infrastructure. Additionally, by running Flink jobs on a shared Kubernetes cluster at Airbnb, we could potentially improve the overall cost efficiency of our company’s infrastructure.

Future Work

In the Flink world, node rotations in Kubernetes can cause job restarts and result in downtime. While Flink itself can recover from job restarts without data loss, the potential downtime and availability impact may be unfavorable for highly latency-sensitive applications. To address this, there are a few approaches we are evaluating.

Reducing the number of node rotations to minimize job restarts.
Faster job recovery.

Enable Job Autoscaling

With the introduction of Reactive Mode in Flink 1.13, users can dynamically adjust the parallelism of their jobs without the need for a job restart. This job auto scaling feature can enhance job stability and cost efficiency. In the future we could enable autoscaling for Flink Kubernetes workloads by leveraging system metrics (such as CPU usage) and Flink metrics (such as backpressure), to determine the appropriate parallelism.

Flink Kubernetes Operator

The Flink Kubernetes Operator utilizes Custom Resources and functions as a controller to manage the entire production lifecycle of Flink applications. By leveraging the operator, we can streamline the operation and deployment processes for Flink jobs. It provides better control over deployment and lifecycle of jobs, and an out of box solution for autoscaling and auto tuning.

Conclusion

To summarize, the migration of Airbnb’s streaming processing architecture based on Apache Flink from Hadoop Yarn to Kubernetes has been a significant milestone in enhancing our streaming data processing capabilities. This transition has resulted in a more streamlined and user-friendly experience for Flink developers. By overcoming challenges that were complex to address on Yarn, we have laid the foundation for more efficient and effective streaming data processing.

As we look ahead, we are committed to further refining our approach and resolving any remaining challenges. We are enthusiastic about the ongoing growth and potential of Apache Flink within our company, and we anticipate continued innovation and improvement in the future.

If this kind of work sounds appealing to you, check out our open roles — we’re hiring!

Appreciations

The Flink on Kubernetes platform would not have been possible without cross-functional and cross-org collaborators as well as leadership support. They include, but are not limited to: Jingwei Lu, Long Zhang, Daniel Low, Weibo He, Zack Loebel-Begelman, Justin Cunningham, Adam Kocoloski, Liyin Tang and Nathan Towery.

Special thanks to the broader Airbnb data community members who provided input or aid to the implementation team throughout the design, development, and launch phases.

We also want to thank Wei Hou and Xu Zhang for their support in authoring this post during their time at Airbnb.

****************

Apache Spark™, Apache Airflow™, and Apache ZooKeeper™ are trademarks of The Apache Software Foundation.

Apache Flink® and Apache Hadoop® are registered trademarks of The Apache Software Foundation.

Kubernetes® is a registered trademark of The Linux Foundation.

Amazon S3 and AWS are trademarks of Amazon.com, Inc. or its affiliates.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.