我的数据在哪里--与 Flink Streaming 的 Kinesis 连接器的独特邂逅
For years now, Lyft has not only been a proponent of but also a contributor to Apache Flink. Lyft’s pipelines have evolved drastically over the years, yet, time and time again, we run into unique cases that stretch Flink to its breaking points — this is one of those times.
多年来,Lyft 不仅是 Apache Flink 的支持者,也是其贡献者。这些年来,Lyft 的管道已经发生了翻天覆地的变化,然而,我们一次又一次地遇到了让 Flink 达到极限的独特案例--这次就是其中之一。
Context
背景
While Lyft runs many streaming applications, the one specifically in question is a persistence job. Simply put, it streams data from Kinesis, performs some level of serializations and transformations, and writes to S3 every few minutes.
虽然 Lyft 运行着许多流应用程序,但其中一个是持久化作业。简单地说,它从 Kinesis 流式传输数据,执行一定程度的序列化和转换,并每隔几分钟写入 S3。
Flink pipeline for persisting data from Kinesis to S3.
将数据从 Kinesis 持久化到 S3 的 Flink 管道。
In this case, it persists a hefty majority of events generated at Lyft, occurring at a rate of 80 gigabytes per minute on average and running at a parallelism of 1800, which happens to be one of Lyft’s largest streaming jobs.
在本例中,它持续处理了 Lyft 产生的绝大多数事件,平均每分钟发生 80 千兆字节,并行度高达 1800,这恰好是 Lyft 最大的流作业之一。
Chapter 1: The Outage
第 1 章:停电
Let’s start at the end, shall we?
让我们从结尾开始,好吗?
Data Engineer: “Alert! My reports aren’t being generated! The upstream data is not available to generate them on!”
数据工程师"警报!我的报告没有生成!上游数据无法生成报告!"
Platform Engineer: “I’m on it! Looks like our streaming application to persist data is up and running, but I hardly see any data being written either!”
平台工程师:"我正在处理!看起来我们用于持久化数据的流应用程序已经启动并运行,但我也几乎没看到任何数据被写入!"
Like any good engineer would, we pulled out our runbooks and carefully performed the well-detailed steps:
像所有优秀的工程师一样,我们拿出运行手册,仔细地执行详细的步骤:
Platform Engineer: “Let me roll back our seemingly innocuous change we just deployed.”
平台工程师"让我回滚我们刚刚部署的看似无害的变更"。
Platform Engineer: “No luck.”
平台工程师"不走运"
Platform Engineer: “Ok, let me try turning it off and on again.”
平台工程师"好的,让我试试把它关掉再打开"。
Platform Engineer: “No luck.”
平台工程师"不走运"
**Platform Engineer: “**Ok, let me try performing a hard reset and we’ll backfill later.”
**平台工程师"**好吧...