通过Psyberg在Netflix简化会员数据工程
By Abhinaya Shetty, Bharath Mummadisetty
由 Abhinaya Shetty 和 Bharath Mummadisetty 撰写
At Netflix, our Membership and Finance Data Engineering team harnesses diverse data related to plans, pricing, membership life cycle, and revenue to fuel analytics, power various dashboards, and make data-informed decisions. Many metrics in Netflix’s financial reports are powered and reconciled with efforts from our team! Given our role on this critical path, accuracy is paramount. In this context, managing the data, especially when it arrives late, can present a substantial challenge!
在 Netflix,我们的会员和财务数据工程团队收集与计划、定价、会员生命周期和收入相关的多样化数据,用于支持分析、驱动各种仪表盘,并做出数据驱动的决策。我们团队的努力为 Netflix 的许多指标提供支持和对账!在这个关键路径上,准确性至关重要。在这种情况下,特别是当数据延迟到达时,数据管理可能会带来重大挑战!
In this three-part blog post series, we introduce you to Psyberg, our incremental data processing framework designed to tackle such challenges! We’ll discuss batch data processing, the limitations we faced, and how Psyberg emerged as a solution. Furthermore, we’ll delve into the inner workings of Psyberg, its unique features, and how it integrates into our data pipelining workflows. By the end of this series, we hope you will gain an understanding of how Psyberg transformed our data processing, making our pipelines more efficient, accurate, and timely. Let’s dive in!
在这个由三部分组成的博客系列中,我们向您介绍Psyberg,我们的增量数据处理框架,旨在解决这些挑战!我们将讨论批处理数据处理、我们所面临的限制以及 Psyberg 如何成为解决方案。此外,我们还将深入探讨 Psyberg 的内部工作原理、其独特功能以及它如何与我们的数据管道工作流集成。通过本系列的最后,我们希望您能了解 Psyberg 如何改变我们的数据处理方式,使我们的管道更加高效、准确和及时。让我们开始吧!
The Challenge: Incremental Data Processing with Late Arriving Data
挑战:使用延迟到达数据进行增量数据处理
Our teams’ data processing model mainly comprises batch pipelines, which run at different intervals ranging from hourly to multiple times a day (also known as intraday) and even daily. We expect complete and accurate data at the end of each run. To meet such expectations, we generally run our pipelines with a lag of a few hours to leave room for late-arriving data.
我们团队的数据处理模型主要包括批处理管道,其运行间隔从每小时到多次每天(也称为日内)甚至每天不等。我们期望在每次运行结束时获得完整准确的数据。为了满足这样的...