How Meta discovers data flows via lineage at scale
- Data lineage is an instrumental part of Meta’s Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Meta’s systems. This allows us to verify that our users’ everyday interactions are protected across our family of apps, such as their religious views in the Facebook Dating app, the example we’ll walk through in this post.
- In order to build high-quality data lineage, we developed different techniques to collect data flow signals across different technology stacks: static code analysis for different languages, runtime instrumentation, and input and output data matching, etc. We then built an intuitive UX into our tooling that enables developers to effectively consume all of this lineage data in a systematic way, saving significant engineering time for building privacy controls.
- As we expanded PAI across Meta, we gained valuable insights about the data lineage space. Our understanding of the privacy space evolved, revealing the need for early focus on data lineage, tooling, a cohesive ecosystem of libraries, and more. These initiatives have assisted in accelerating the development of data lineage and implementing purpose limitation controls more quickly and efficiently.
At Meta, we believe that privacy enables product innovation. This belief has led us to developing Privacy Aware Infrastructure (PAI), which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as purpose limitation, which restricts the purposes for which data can be processed and used.
In this blog, we will delve into an early stage in PAI implementation: data lineage. Data lineage refers to the process of tracing the journey of data as it moves through various systems, illustrating how data transitions from one data asset, such as a database table (the source asset), to another (the sink asset). We’ll also walk through how we track the lineage of users’ “religion” information in our Facebook Dating app.
Millions of data assets are vital for supporting our product ecosystem, ensuring the functionality our users anticipate, maintaining high product quality, and safeguarding user safety and integrity. Data lineage enables us to efficiently navigate these assets and protect user data. It enhances the traceability of data flows within systems, ultimately empowering developers to swiftly implement privacy controls and create innovative products.
Note that data lineage is dependent on having already completed important and complex preliminary steps to inventory, schematize, and annotate data assets into a unified asset catalog. This took Meta multiple years to complete across our millions of disparate data assets, and we’ll cover each of these more deeply in future blog posts:
- Inventorying involves collecting various code and data assets (e.g., web endpoints, data tables, AI models) used across Meta.
- Schematization expresses data assets in structural detail (e.g., indicating that a data asset has a field called “religion”).
- Annotation labels data to describe its content (e.g., specifying that the identity column contains religion data).
Understanding data lineage at Meta
To establish robust privacy controls, an essential part of our PAI initiative is to understand how data flows across different systems. Data lineage is part of this discovery step in the PAI workflow, as shown in the following diagram:
Data lineage is a key precursor to implementing Policy Zones, our information flow control technology, because it answers the question, “Where does my data come from and where does it go?” – helping inform the right places to apply privacy controls. In conjunction with Policy Zones, data lineage provides the following key benefits to thousands of developers at Meta:
- Scalable data flow discovery: Data lineage answers the question above by providing an end-to-end, scalable graph of relevant data flows. We can leverage the lineage graphs to visualize and explain the flow of relevant data from the point where it is collected to all the places where it is processed.
- Efficient rollout of privacy controls: By leveraging data lineage to track data flows, we can easily pinpoint the optimal integration points for privacy controls like Policy Zones within the codebase, streamlining the rollout process. Thus we have developed a powerful flow discovery tool as part of our PAI tool suite, Policy Zone Manager (PZM), based on data lineage. PZM enables developers to rapidly identify multiple downstream assets from a set of sources simultaneously, thereby accelerating the rollout process of privacy controls.
- Continuous compliance verification: Once the privacy requirement has been fully implemented, data lineage plays a vital role in monitoring and validating data flows continuously, in addition to the enforcement mechanisms such as Policy Zones.
Traditionally, data lineage has been collected via code inspection using manually authored data flow diagrams and spreadsheets. However, this approach does not scale in large and dynamic environments, such as Meta, with billions of lines of continuously evolving code. To tackle this challenge, we’ve developed a robust and scalable lineage solution that uses static code analysis signals as well as runtime signals.
Walkthrough: Implementing data lineage for religion data
We’ll share how we have automated lineage tracking to identify religion data flows through our core systems, eventually creating an end-to-end, precise view of downstream religion assets being protected, via the following two key stages:
- Collecting data flow signals: a process to capture data flow signals from many processing activities across different systems, not only for religion, but for all other types of data, to create an end-to-end lineage graph.
- Identifying relevant data flows: a process to identify the specific subset of data flows (“subgraph”) within the lineage graph that pertains to religion.
These stages propagate through various systems including function-based systems that load, process, and propagate data through stacks of function calls in different programming languages (e.g., Hack, C++, Python, etc.) such as web systems and backend services, and batch-processing systems that process data rows in batch (mainly via SQL) such as data warehouse and AI systems.
For simplicity, we will demonstrate these for the web, the data warehouse, and AI, per the diagram below.
Collecting data flow signals for the web system
When setting up a profile on the Facebook Dating app, people can populate their religious views. This information is then utilized to identify relevant matches with other people who have specified matched values in their dating preferences. On Dating, religious views are subject to purpose limitation requirements, for example, they will not be used to personalize experiences on other Facebook Products.
We start with someone entering their religion information on their dating media profile using their mobile device, which is then transmitted to a web endpoint. The web endpoint subsequently logs the data into a logging table and stores it in a database, as depicted in the following code snippet:
Now let’s see how we collect lineage signals. To do this, we need to employ both static and runtime analysis tools to effectively discover data flows, particularly focusing on where religion is logged and stored. By combining static and runtime analysis, we enhance our ability to accurately track and manage data flows.
Static analysis tools simulate code execution to map out data flows within our systems. They also emit quality signals to indicate the confidence of whether a data flow signal is a true positive. However, these tools are limited by their lack of access to runtime data, which can lead to false positives from unexecuted code.
To address this limitation, we utilize Privacy Probes, a key component of our PAI lineage technologies. Privacy Probes automate data flow discovery by collecting runtime signals. These signals are gathered in real time during the execution of requests, allowing us to trace the flow of data into loggers, databases, and other services.
We have instrumented Meta’s core data frameworks and libraries at both the data origin points (sources) and their eventual outputs (sinks), such as logging framework, which allows for comprehensive data flow tracking. This approach is exemplified in the following code snippet:
During runtime execution, Privacy Probes does the following:
- Capturing payloads: It captures source and sink payloads in memory on a sampled basis, along with supplementary metadata such as event timestamps, asset identifiers, and stack traces as evidence for the data flow.
- Comparing payloads: It then compares the source and sink payloads within a request to identify data matches, which helps in understanding how data flows through the system.
- Categorizing results: It categorizes results into two sets. The match-set includes pairs of source and sink assets where data matches exactly or one is contained by another, therefore providing high confidence evidence of data flow between the assets. The full-set includes all source and sink pairs within a request no matter whether the sink is tainted by the source. Full-set is a superset of match-set with some noise but still important to send to human reviewers since it may contain transformed data flows.
The above procedure is depicted in the diagram below:
Let’s look at the following examples where various religions are received in an endpoint and various values (copied or transformed) being logged in three different loggers:
Input Value (source) | Output Value (sink) | Data Operation | Match Result | Flow Confidence |
“Atheist” | “Atheist” | Data Copy | EXACT_MATCH | HIGH |
“Buddhist” | {metadata: {religion: Buddhist}} | Substring | CONTAINS | HIGH |
{religions: [“Catholic”, “Christian”]} | {count : 2} | Transformed | NO_MATCH | LOW |
In the examples above, the first two rows show a precise match of religions in the source and the sink values, thus belonging to the high confidence match-set. The third row depicts a transformed data flow where the input string value is transformed to a count of values before being logged, belonging to full-set.
These signals together are used to construct a lineage graph to understand the flow of data through our web system as shown in the following diagram:
Collecting data flow signals for the data warehouse system
With the user’s religion logged in our web system, it can propagate to the data warehouse for offline processing. To gather data flow signals, we employ a combination of both runtime instrumentation and static code analysis in a different way from the web system. The involved SQL queries are logged for data processing activities by the Presto and Spark compute engines (among others). Static analysis is then performed for the logged SQL queries and job configs in order to extract data flow signals.
Let’s examine a simple SQL query example that processes data for the data warehouse as the following:
We’ve developed a SQL analyzer to extract data flow signals between the input table, “safety_log_tbl” and the output table, “safety_training_tbl” as shown in the following diagram. In practice, we also collect more granular-level lineage such as at column-level (e.g., “user_id” -> “target_user_id”, “religion” -> “target_religion”).
There are instances where data is not fully processed by SQL queries, resulting in logs that contain data flow signals for either reads or writes, but not both. To ensure we have complete lineage data, we leverage contextual information (such as execution environments; job or trace IDs) collected at runtime to connect these reads and writes together.
The following diagram illustrates how the lineage graph has expanded:
Collecting data flow signals for the AI system
For our AI systems, we collect lineage signals by tracking relationships between various assets, such as input datasets, features, models, workflows, and inferences. A common approach is to extract data flows from job configurations used for different AI activities such as model training.
For instance, in order to improve the relevance of dating matches, we use an AI model to recommend potential matches based on shared religious views from users. Let’s take a look at the following training config example for this model that uses religion data:
By parsing this config obtained from the model training service, we can track the data flow from the input dataset (with asset ID asset://hive.table/dating_training_tbl) and feature (with asset ID asset://ai.feature/DATING_USER_RELIGION_SCORE) to the model (with asset ID asset://ai.model/dating_ranking_model).
Our AI systems are also instrumented so that asset relationships and data flow signals are captured at various points at runtime, including data-loading layers (e.g., DPP) and libraries (e.g., PyTorch), workflow engines (e.g., FBLearner Flow), training frameworks, inference systems (as backend services), etc. Lineage collection for backend services utilizes the approach for function-based systems described above. By matching the source and sink assets for different data flow signals, we are able to capture a holistic lineage graph at the desired granularities:
Identifying relevant data flows from a lineage graph
Now that we have the lineage graph at our disposal, how can we effectively distill a subset of data flows pertinent to a specific privacy requirement for religion data? To address this question, we have developed an iterative analysis tool that enables developers to pinpoint precise data flows and systematically filter out irrelevant ones. The tool kicks off a repetitive discovery process aided by the lineage graph and privacy controls from Policy Zones, to narrow down the most relevant flows. This refined data allows developers to make a final determination about the flows they would like to use, producing an optimal path for traversing the lineage graph. The following are the major steps involved, captured holistically in the diagram, below:
- Discover data flows: identify data flows from source assets and stop at downstream assets with low-confidence flows (yellow nodes).
- Exclude and include candidates: Developers or automated heuristics exclude candidates (red nodes) that don’t have religion data or include remaining ones (green nodes). By excluding the red nodes early on, it helps to exclude all of their downstream in a cascaded manner, and thus saves developer efforts significantly. As an additional safeguard, developers also implement privacy controls via Policy Zones, so all relevant data flows can be captured.
- Repeat discovery cycle: use the green nodes as new sources and repeat the cycle until no more green nodes are confirmed.
With the collection and data flow identification steps complete, developers are able to successfully locate granular data flows that contain religion across Meta’s complex systems, allowing them to move forward in the PAI workflow to apply necessary privacy controls to safeguard the data. This once-intimidating task has been completed efficiently.
Our data lineage technology has provided developers with an unprecedented ability to quickly understand and protect religion and similar sensitive data flows. It enables Meta to scalably and efficiently implement privacy controls via PAI to protect our users’ privacy and deliver products safely.
Learnings and challenges
As we’ve worked to develop and implement lineage as a core PAI technology, we’ve gained valuable insights and overcome significant challenges, yielding some important lessons:
- Focus on lineage early and reap the rewards: As we developed privacy technologies like Policy Zones, it became clear that gaining a deep understanding of data flows across various systems is essential for scaling the implementation of privacy controls. By investing in lineage, we not only accelerated the adoption of Policy Zones but also uncovered new opportunities for applying the technology. Lineage can also be extended to other use cases such as security and integrity.
- Build lineage consumption tools to gain engineering efficiency: We initially focused on building a lineage solution but didn’t give sufficient attention to consumption tools for developers. As a result, owners had to use raw lineage signals to discover relevant data flows, which was overwhelmingly complex. We addressed this issue by developing the iterative tooling to guide engineers in discovering relevant data flows, significantly reducing engineering efforts by orders of magnitude.
- Integrate lineage with systems to scale the coverage: Collecting lineage from diverse Meta systems was a significant challenge. Initially, we tried to ask every system to collect lineage signals to ingest into the centralized lineage service, but the progress was slow. We overcame this by developing reliable, computationally efficient, and widely applicable PAI libraries with built-in lineage collection logic in various programming languages (Hack, C++, Python, etc.). This enabled much smoother integration with a broad range of Meta’s systems.
- Measurement improves our outcomes: By incorporating the measurement of coverage, we’ve been able to evolve our data lineage so that we stay ahead of the ever-changing landscape of data and code at Meta. By enhancing our signals and adapting to new technologies, we can maintain a strong focus on privacy outcomes and drive ongoing improvements in lineage coverage across our tech stacks.
The future of data lineage
Data lineage is a vital component of Meta’s PAI initiative, providing a comprehensive view of how data flows across different systems. While we’ve made significant progress in establishing a strong foundation, our journey is ongoing. We’re committed to:
- Expanding coverage: continuously enhance the coverage of our data lineage capabilities to ensure a comprehensive understanding of data flows.
- Improving consumption experience: streamline the consumption experience to make it easier for developers and stakeholders to access and utilize data lineage information.
- Exploring new frontiers: investigate new applications and use cases for data lineage, driving innovation and collaboration across the industry.
By advancing data lineage, we aim to foster a culture of privacy awareness and drive progress in the broader fields of study. Together, we can create a more transparent and accountable data ecosystem.
Acknowledgements
The authors would like to acknowledge the contributions of many current and former Meta employees who have played a crucial role in developing data lineage technologies over the years. In particular, we would like to extend special thanks to (in alphabetical order) Amit Jain, Aygun Aydin, Ben Zhang, Brian Romanko, Brian Spanton, Daniel Ramagem, David Molnar, Dzmitry Charnahalau, Gayathri Aiyer, George Stasa, Guoqiang Jerry Chen, Graham Bleaney, Haiyang Han, Howard Cheng, Ian Carmichael, Ibrahim Mohamed, Jerry Pan, Jiang Wu, Jonathan Bergeron, Joanna Jiang, Jun Fang, Kiran Badam, Komal Mangtani, Kyle Huang, Maharshi Jha, Manuel Fahndrich, Marc Celani, Lei Zhang, Mark Vismonte, Perry Stoll, Pritesh Shah, Qi Zhou, Rajesh Nishtala, Rituraj Kirti, Seth Silverman, Shelton Jiang, Sushaant Mujoo, Vlad Fedorov, Yi Huang, Xinbo Gao, and Zhaohui Zhang. We would also like to express our gratitude to all reviewers of this post, including (in alphabetical order) Aleksandar Ilic, Avtar Brar, Benjamin Renard, Bogdan Shubravyi, Brianna O’Steen, Chris Wiltz, Daniel Chamberlain, Hannes Roth, Imogen Barnes, Jason Hendrickson, Koosh Orandi, Rituraj Kirti, and Xenia Habekoss. We would like to especially thank Jonathan Bergeron for overseeing the effort and providing all of the guidance and valuable feedback, Supriya Anand for leading the editorial effort to shape the blog content, and Katherine Bates for pulling all required support together to make this blog post happen.