Balancing quality and coverage with our data validation framework

摘要

At Dropbox, we store data about how people use our products and services in a Hadoop-based data lake. Various teams rely on the information in this data lake for all kinds of business purposes—for example, analytics, billing, and developing new features—and our job is to make sure that only good quality data reaches the lake.

Our data lake is over 55 petabytes in size, and quality is always a big concern when working with data at this scale. The features we build, the decisions we make, and the financial results we report all hinge on our data being accurate and correct. But with so much data to sift through, quality problems can be incredibly hard to find—if we even know they exist in the first place. It's the data engineering equivalent of looking for a black cat in a dark room.

欢迎在评论区写下你对这篇文章的看法。

评论

ホーム - Wiki
Copyright © 2011-2024 iteam. Current version is 2.132.0. UTC+08:00, 2024-09-21 17:46
浙ICP备14020137号-1 $お客様$