数据质量评分:Airbnb数据质量的下一个篇章
By: Clark Wright
作者:Clark Wright
Introduction
介绍
These days, as the volume of data collected by companies grows exponentially, we’re all realizing that more data is not always better. In fact, more data, especially if you can’t rely on its quality, can hinder a company by slowing down decision-making or causing poor decisions.
如今,随着公司收集的数据量呈指数增长,我们都意识到更多的数据并不总是更好。事实上,如果无法依赖数据的质量,更多的数据可能会阻碍公司的决策速度或导致糟糕的决策。
With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data.
截至2022年底,Airbnb累计接待了14亿次客人入住,其增长推动我们达到了一个拐点,数据质量的下降开始妨碍我们的数据从业人员。每周的指标报告很难准时完成。看似基本的指标,如“活跃房源”,依赖于一系列上游依赖关系。进行有意义的数据工作需要大量的机构知识,以克服数据中隐藏的注意事项。
To meet this challenge, we introduced the “Midas” process to certify our data. Starting in 2020, the Midas process, along with the work to re-architect our most critical data models, has brought a dramatic increase in data quality and timeliness to Airbnb’s most critical data. However, achieving the full data quality criteria required by Midas demands significant cross-functional investment to design, develop, validate, and maintain the necessary data assets and documentation.
为了应对这一挑战,我们引入了“Midas”流程来认证我们的数据。从2020年开始,Midas流程以及重新架构我们最关键的数据模型的工作,为Airbnb最关键的数据带来了数据质量和及时性的显著提高。然而,实现Midas所要求的完整数据质量标准需要跨职能投资来设计、开发、验证和维护必要的数据资产和文档。
While this made sense for our most critical data, pursuing such rigorous standards at scale presented challenges. We were approaching a point of diminishing returns on our data quality investments. We had certified our most critical assets, restoring their trustworthiness. However, for all of our uncertified data, which remained the majority of our offline data, we lacked visibility into its quality and didn’t have c...