Uber如何在数据质量体验方面实现卓越运营
Uber delivers efficient and reliable transportation across the global marketplace, which is powered by hundreds of services, machine learning models, and tens of thousands of datasets. While growing rapidly, we’re also committed to maintaining data quality, as it can greatly impact business operations and decisions. Without data quality guarantees, downstream service computation or machine learning model performance quickly degrade, which requires a lot of laborious manual efforts to investigate and backfill poor data. In the worst cases, degradations could go unnoticed, silently resulting in inconsistent behaviors.
Uber在全球市场提供高效、可靠的交通服务,这是由数百种服务、机器学习模型和数万个数据集提供的。在快速增长的同时,我们也致力于维护数据质量,因为它可以极大地影响业务运营和决策。如果没有数据质量的保证,下游服务的计算或机器学习模型的性能会迅速下降,这就需要大量费力的手工工作来调查和回填不良数据。在最坏的情况下,退化可能不被注意,默默地导致不一致的行为。
This led us to build a consolidated data quality platform (UDQ), with the purpose of monitoring, automatically detecting, and handling data quality issues. With the goal of building and achieving data quality standards across Uber, we have supported over 2,000 critical datasets on this platform, and detected around 90% of data quality incidents. In this blog, we describe how we created data quality standards at Uber and built the integrated workflow to achieve operational excellence.
这促使我们建立了一个综合数据质量平台(UDQ),目的是监测、自动检测和处理数据质量问题。以建立和实现整个Uber的数据质量标准为目标,我们已经在这个平台上支持了超过2000个关键数据集,并检测了大约90%的数据质量事件。在这篇博客中,我们描述了我们如何在Uber创建数据质量标准,并建立集成的工作流程以实现卓越运营。
As a data-driven company, Uber makes every business decision based on large-scale data collected from the marketplace. For example, the surge multiplier for a trip is calculated by real-time machine learning models based on a bunch of factors: regions, weather, events, etc. The two most decisive factors, however, are demand and supply data in the current area. If supply data is higher than demand, the surge multiplier will approach its minimal value (1.0), and vice versa.
作为一个数据驱动的公司,Uber根据从市场上收集的大规模数据做出每一个商业决策。例如,一个行程的激增倍数是由实时机器学习模型根据一堆因素计算出来的:地区、天气、事件等。然而,两个最具决定性的因素是当前地区的需求和供应数据。如果供应数...