Data Quality Score: The next chapter of data quality at Airbnb

Introduction

These days, as the volume of data collected by companies grows exponentially, we’re all realizing that more data is not always better. In fact, more data, especially if you can’t rely on its quality, can hinder a company by slowing down decision-making or causing poor decisions.

With 1.4 billion cumulative guest arrivals as of year-end 2022, Airbnb’s growth pushed us to an inflection point where diminishing data quality began to hinder our data practitioners. Weekly metric reports were difficult to land on time. Seemingly basic metrics like “Active Listings” relied on a web of upstream dependencies. Conducting meaningful data work required significant institutional knowledge to overcome hidden caveats in our data.

To meet this challenge, we introduced the “Midas” process to certify our data. Starting in 2020, the Midas process, along with the work to re-architect our most critical data models, has brought a dramatic increase in data quality and timeliness to Airbnb’s most critical data. However, achieving the full data quality criteria required by Midas demands significant cross-functional investment to design, develop, validate, and maintain the necessary data assets and documentation.

While this made sense for our most critical data, pursuing such rigorous standards at scale presented challenges. We were approaching a point of diminishing returns on our data quality investments. We had certified our most critical assets, restoring their trustworthiness. However, for all of our uncertified data, which remained the majority of our offline data, we lacked visibility into its quality and didn’t have clear mechanisms for up-leveling it.

How could we scale the hard-fought wins and best practices of Midas across our entire data warehouse?

In this blog post, we share our innovative approach to scoring data quality, Airbnb’s Data Quality Score (“DQ Score”). We’ll cover how we developed the DQ Score, how it’s being used today, and how it will power the next chapter of data quality at Airbnb.

Scaling Data Quality

In 2022, we began exploring ideas for scaling data quality beyond Midas certification. Data producers were requesting a lighter-weight process that could provide some of the quality guardrails of Midas, but with less rigor and time investment. Meanwhile, data consumers continued to fly blind on all data that wasn’t Midas-certified. The brand around Midas-certified data was so strong that consumers started to question whether they should trust any uncertified data. Hesitant to dilute the Midas branding, we wanted to avoid introducing a lightweight version of certification that further stratified our data without truly unlocking long-term scalability.

Considering these challenges, we decided to shift to a data quality strategy that pushed the incentives around data quality directly to data producers and consumers. We made the decision that we could no longer rely on enforcement to scale data quality at Airbnb, and we instead needed to rely on incentivization of both the data producer and consumer.

To fully enable this incentivization approach, we believed it would be paramount to introduce the concept of a data quality score directly tied to data assets.

We identified the following objectives for the score:

Evolve our understanding of data quality beyond a simple binary definition (certified vs uncertified).
Align on the input components for assessing data quality.
Enable full visibility into the quality of our offline data warehouse and individual data assets. This visibility should 1) Create natural incentives for producers to improve the quality of the data they own, and 2) Drive demand for high-quality data from data consumers and enable consumers to decide if the quality is appropriate for their needs.

Composing the Score

Before diving into the nuances of measuring data quality, we drove alignment on the vision by defining our DQ Score guiding principles. With the input of a cross-functional group of data practitioners, we aligned on these guiding principles:

Full coverage — score can be applied to any in-scope data warehouse data asset
Automated — collection of inputs that determine the score is 100% automated
Actionable — score is easy to discover and actionable for both producers and consumers
Multi-dimensional — score can be decomposed into pillars of data quality
Evolvable — scoring criteria and their definitions can change over time

While they may seem simple or obvious, establishing these principles was critical as they guided each decision made in developing the score. Questions that otherwise would have derailed progress were mapped back to our principles.

For example, our principles were critical in determining which items from our wishlist of scoring criteria should be considered. There were several inputs that certainly could help us measure quality, but if they could not be automatically measured (Automated), or if they were so convoluted that data practitioners wouldn’t understand what the criterion meant or how it could be improved upon (Actionable), then they were discarded.

We also had a set of input signals that more directly measure quality (Midas certification, data validation, bugs, SLAs, automated DQ checks, etc.), whereas others were more like proxies for quality (e.g., valid ownership, good governance hygiene, the use of paved path tooling). Were the more explicit and direct measurements of quality more valuable than the proxies?

Guided by our principles, we eventually settled on having four dimensions of data quality: Accuracy, Reliability (Timeliness), Stewardship, and Usability. There were several other possible dimensions that we considered, but these four dimensions were the most meaningful and useful to our data practitioners, and made sense as axes of improvement, where we care and are willing to invest in improving our data along these dimensions.

Each dimension could mix implicit and explicit quality indicators, with the key being: Not every data consumer needs to fully understand every individual scoring component, but they’ll understand that a dataset that scores poorly on Reliability and Usability struggles with landing on-time consistently and is difficult to use.

We could also weigh each dimension according to our perception of its importance in determining quality. We considered 1) how many scoring components belonged to each dimension, 2) enabling quick mental math, and 3) which elements our practitioners care about most to allocate 100 total points across the dimensions:

The “Dimensions of Data Quality” and their weights

Meanwhile, if desired, the dimensions could be unpacked to get to a more detailed view of data quality issues. For example, the Stewardship dimension scores an asset for quality indicators like whether it’s built on our paved path data engineering tools, its governance hygiene, and whether it meets valid data ownership standards.

Unpacking the Data Stewardship Dimension

Presenting the Score to Practitioners

We knew surfacing the DQ Score in an explorable, actionable format was critical to its adoption and success. Furthermore, we had to surface data quality information directly in the venue where data users already discovered and explored data.

Luckily, we had two existing tools that would make this much easier: Dataportal (Airbnb’s data catalog and exploration UI), and the Unified Metadata Service (UMS). The score itself is computed in a daily offline data pipeline that collects and transforms various metadata elements from our data systems. The final task of the pipeline uploads the score for each data asset into UMS. By ingesting the DQ Score into UMS, we can surface the score and its components alongside every data asset in Dataportal, the starting point for all data discovery and exploration at Airbnb. All that remained was designing its presentation.

One of our goals was to surface the concept of quality to data practitioners with varying expertise and needs. Our user base had fully adopted the certified vs uncertified dynamic, but this was the first time we would be presenting the concept of a spectrum of quality, as well as the criteria used to define quality.

What would be the most interpretable version of a DQ Score? We needed to be able to present a single data quality score that held meaning at quick glance, while also making it possible to explore the score in more detail.

Our final design presents data quality in three ways, each with a different use case in mind:

A single, high-level score from 0–100. We assigned categorical thresholds of “Poor”, “Okay”, “Good”, and “Great” based on a profiling analysis of our data warehouse that examined the existing distribution of our DQ score. Best for quick, high-level assessment of a dataset’s overall quality.
Dimensional scores, where an asset can score perfectly on Accuracy but low on Reliability. Useful when a particular area of deficiency is not problematic (e.g., the consumer wants the data to be very accurate but is not worried about it landing quickly every day).
Full score detail + Steps to improve, where data consumers can see exactly where an asset falls short and data producers can take action to improve an asset’s quality.

All three of these presentations are shown in the screenshots below. The default presentation provides the dimensional scores “Scores per category”, the categorical descriptor of “Poor” along with the 40 points, and steps to improve.

Full data quality score page in Dataportal

If a user explores the full score details, they can examine the exact quality shortcomings and view informative tooltips providing more detail on the scoring component’s definition and merit.

Full score detail presentation

How the Score Is Being Used Today

For data producers, the score is providing

Clear, actionable steps to improve the DQ of their assets
Quantified DQ, measuring their work
Clear expectations around DQ
Targets for tech debt clean-up

For data consumers, the DQ Score

Improves data discoverability
Serves as a signal of trustworthiness for data (just like how the review system works for Airbnb Guests and Hosts)
Informs consumers of the exact quality shortcomings so they can be comfortable how they’re using the data
Enables consumers to seek out and demand data quality

From a data strategy perspective, we are leveraging internal query data combined with the DQ Score to drive DQ efforts across our data warehouse. By considering both the volume and the type of consumption (e.g., whether a particular metric is surfaced in our Executive reporting), we are able to direct data teams to the most impactful data quality improvements. This visibility has been very enlightening for teams who were unaware of their long tail of low-quality assets, and has enabled us to double down on quality investments for heavy-lift data models that power a significant share of our data consumption.

Finally, by developing the DQ Score, we were able to provide uniform guidance to our data producers on producing high-quality, albeit uncertified assets. The DQ Score has not replaced certification (e.g., only Midas-certified data can achieve a DQ Score > 90). We continue to certify our most critical subset of data, and believe the use cases for these assets will always merit the manual validation, rigor, and upkeep of certification. But for everything else, the DQ Score reinforces and scales the principles of Midas across our warehouse.

What’s Next

We’re excited about now being able to measure and observe quantified improvements to our data quality, but we’re just getting started. We recently expanded on the original DQ Score to score our Minerva metrics and dimensions. Similarly, we plan to bring the same concept of a DQ Score to other data assets like our event logs and ML features.

As the requirements and demands against our data continue to evolve, so will our quality expectations. We’ll continue to evolve how we define and measure quality, and with rapid improvement in areas like metadata management and data classification, we anticipate further efficiency and productivity gains for all data practitioners at Airbnb.

Appreciations

The DQ Score would not have been possible without several cross-functional and cross-org collaborators. They include, but are not limited to: Mark Steinbrick, Chitta Shirolkar, Jonathan Parks, Sylvia Tomiyama, Felix Ouk, Jason Flittner, Ying Pan, Logan George, Woody Zhou, Michelle Thomas, and Erik Ritter.

Special thanks to the broader Airbnb data community members who provided input or aid to the implementation team throughout the design, development, and launch phases.

If this type of work interests you, check out some of our related positions.

****************

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.