Slack的数据线

Reinventing how the world does work inevitably creates a lot of data. Each year, Slack’s scale has increased and the volume of data ingested and stored has kept pace. To make it possible to understand relationships within our data, we’ve invested heavily in an automated data lineage framework. This facilitates producer/consumer coordination, improves risk mitigation, impact analysis, and better execution of data programs here at Slack.

重塑世界的工作方式,不可避免地会产生大量的数据。每一年,Slack的规模都在增加,摄入和存储的数据量也跟上了步伐。为了能够理解我们数据中的关系,我们在自动化的数据线框架上进行了大量的投资。这促进了生产者/消费者的协调,改善了风险缓解,影响分析,以及在Slack这里更好地执行数据项目。

Why invest in data lineage?

为什么投资于数据线?

Data lineage refers to the ability to trace how and where data sources are used. In the first years of a company, data lineage is easy to fully understand: a company with only a handful of data pipelines doesn’t need to worry much about data lineage since they can count the number of tables with their fingers. However, as datasets become more complex and the number of contributors grow, it becomes more and more difficult to understand the relationships between different data sources.

数据脉络指的是追踪数据源的使用方式和地点的能力。在一个公司的最初几年,数据脉络很容易被完全理解:一个只有少量数据管道的公司不需要太担心数据脉络,因为他们可以用手指数出表的数量。然而,随着数据集变得越来越复杂,贡献者的数量越来越多,理解不同数据源之间的关系变得越来越困难。

Having a solid understanding of data lineage makes operational maintenance much easier. A common request of data engineering teams is to backfill a table after a bug is fixed in its source data. For tables that are never consumed by other data pipelines, this is a trivial task: just rerun all the affected dates for the requested table. For tables that are consumed by other jobs, the complexity of backfill balloons: all downstream tables might need to be rerun, depending on if the columns impacted by the bug fix were consumed by that table. Fortunately, it isn’t every day that massive backfills need to run, meaning good data lineage was historically a “nice to have” feature at many companies. This is, at least, until the advent of the General Data Protection Regulation (GDPR).

有了对数据脉络的扎实了解,运营...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-23 07:39
浙ICP备14020137号-1 $访客地图$