在Lyft协调数据管线:比较Flyte和Airflow
In a data-driven company like Lyft, data is the core backbone for many application components. Data analytics gives us the incentives for improving existing features and creating new ones. Today, Lyft collects and processes about 9 trillion analytical events per month, running around 750K data pipelines and 400K Spark jobs using millions of containers.
在像Lyft这样的数据驱动型公司,数据是许多应用组件的核心支柱。数据分析为我们提供了改进现有功能和创造新功能的动力。今天,Lyft每月收集和处理约9万亿个分析事件,使用数百万个容器运行约750K数据管道和400K Spark作业。
In the presence of computation jobs on engines like Spark, Hive, Trino, and lots of Python code for data processing and ML frameworks, workflow orchestration grows into a complex challenge. Orchestration is the mechanism which puts computation tasks together and executes them as a data pipeline, where the data pipeline usually looks like a graph.
在Spark、Hive、Trino等引擎上的计算任务以及大量用于数据处理和ML框架的Python代码的存在下,工作流协调发展成为一个复杂的挑战。协调是将计算任务放在一起并作为数据管道执行的机制,其中数据管道通常看起来像一个图。
Example of a data pipeline
数据管道的例子
It is important to note that orchestration is not the computation itself. Typically, we orchestrate tasks that are performed on external compute clusters.
需要注意的是,协调不是计算本身。通常情况下,我们协调的任务是在外部计算集群上执行的。
Historically, Lyft has used two orchestration engines: Apache Airflow and Flyte. Created and open-sourced by Lyft, Flyte is now a top-level Linux Foundation project.
历史上,Lyft曾使用过两个协调引擎。Apache Airflow和Flyte。Flyte由Lyft创建和开源,现在是一个顶级的Linux基金会项目。
At Lyft, we are using Airflow and Flyte: engineers may choose the engine that better fits their requirements and use case
在Lyft,我们正在使用Airflow和Flyte:工程师可以选择更适合他们要求和使用情况的引擎。
Both Flyte and Airflow are essential pieces of the infrastructure at Lyft and have much in common:
Flyte和Airflow都是Lyft基础设施的重要组成部分,有很多共同之处。
- support Python for writing workflows
- 支持用Python编写工作流程
- run workflows on a scheduled basis or ad-hoc
- 按计划或临时性地运行工作流程
- provide integrations with compute engines
- 提供与计算引擎的集成
- work well for batch and not suited for stream processing
- 适用于批处理,不适合于流处理
We shared our experiences with Airflow and Flyte in the previous posts. In this post, we will...