Redesigning Pinterest’s Ad Serving Systems with Zero Downtime (part 2)

In the first part of the post, we introduced the motivations on why we decided to rewrite the ads serving systems and the desired final state. We also outlined our design principles and high level design decisions on how to get there. In part 2 of the post, we are going to concentrate on the detailed design, implementation, and validation process towards the final launch.

Design Principles

To recap, below is the list of design principles and goals we wanted to achieve:

Easily extensible: The framework and APIs need to be flexible enough to support extensions to new functionalities as well as deprecation of old ones. Design-for-deprecation is often an omitted feature, which is why technical systems become bloated over time.
Separation of concerns: Separation of infra framework by defining high level abstractions that business logic can use. Business logic owned by different teams need to be modularized and isolated from each other.
Safe-by-design: Our framework should support the safe use of concurrency and the enforcement of data integrity rules by default. For example, we want to enable developers to leverage concurrency for performant code, while ensuring there are no race conditions that may cause ML feature discrepancy across serving and logging.
Development velocity: The framework should provide well-supported development environments and easy-to-use tools for debugging and analyses.

Design Decisions

With these principles in mind, designing a complex software system requires us to answer these two key questions:

How do we organize the code so that one team’s change does not break another team’s code?
How do we manage data to guarantee correctness and desired properties throughout the candidate scoring and trimming funnel?

To respond to the above questions, we made two major design decisions: 1) to use an in-house graph execution framework called Apex to organize the code, and 2) to build a write-once data model to be passed around the execution graph to guarantee safe execution and data integrity.

Apex: Graph Execution Framework

Apex is a framework extended from Twitter nodes that features a direct-acyclic graph (DAG) for request processing. Apex encapsulates I/O and data processing as nodes and represents execution dependencies as edges. Each node can be considered a module owned by one team. The contract between two nodes is established by the node interfaces and their unit tests. In order to add new functionality to AdMixer, developers would typically need to add to the graph a new node with proper dependencies to its upstream and downstream nodes. On the other hand, deprecating functionality is as simple as removing a node from the graph. This design addresses the first two principles: extensibility and separation of concerns.

To address the other two design principles, we need to extend the original Twitter nodes implementation by:

Enhancing Apex node definition to be type-safe
Validating the data integrity constraints on the execution graph without executing it

Strongly-Typed Node

The original Twitter nodes interface is weakly-typed in that it is only parameterized by its output type. All inputs of a node are typed as Object and the node implementer is responsible for type cast Objects to its expected input types. We found this can easily result in runtime exceptions related to type casts. This type safety issue violates our third design principle (Principle #3: safe by design). To alleviate this, we extended Apex with a type-safe Node and Graph construct. One example is as follows in Table 1.

Figure 1. Type-safe node and graph constructions

In this example, each node is parameterized with both its input and output types. A graph object is used to connect nodes together with dependencies. If type mismatch happens between input and output nodes, the Java compiler will throw a compile-time error. This allows engineers to fix it during compile time rather than handling runtime exceptions in the production environment. Capturing errors as early as possible is one of our design principles (Principle #4: Development Velocity).

Integrity Constraints on Graphs

Another point worth noting is that with the graph object, we can validate the correctness of the graph at graph construction time without executing it. This allows developers to capture errors at their local dev box without the need to run the graph, which may not be possible outside a production environment. Some correctness rules we check include acyclic, thread-safe, write-once, etc. In the following section, we introduce what the thread-safe and write-once rules are and how we can enforce safety by design.

Data Model

A data model is simply a set of data structures and the corresponding operations they allow, such as getters and setters.

Mohawk primarily relied on Thrift data structures, which allowed the mutation of individual fields anywhere in the code. This violates our second and third design principles. With the Apex graph execution framework, two nodes that do not have ancestor-descendant relationships (i.e., no data dependencies) can be executed concurrently. When the same piece of data is passed to these nodes, there is a thread-safety issue, where the same piece of data can be updated by two threads concurrently.

A typical solution to this problem is to guard the data access with locks. This solves the low level data race issue. However, low level locks do not prevent accidental or unintentional concurrent updates. For example, the bid of an ad can be retrieved from a key-value store by one node A. In another node B, the bid can be multiplied by a multiplier. Without proper data dependency management, these two nodes can be executed in any order and produce different results. We add thread-safety rules to check whether there exists such race conditions for any piece of data.

Another requirement is to satisfy certain data integrity constraints throughout the ad delivery path. One example is that we need to log the same ML features as the ones used for prediction. In a typical ad request, ML features are enriched first and then used by candidate generation, ranking, and finally the auction. Then, the features are logged for ML training purposes. In between these steps, there can be other nodes that filter candidates or update features due to business logic. Supporting data integrity rules allows us to ensure the logged ML features match what we used for serving.

To achieve these goals, we adopted a data-oriented approach when designing the whole graph. We decompose the whole graph into subgraphs, where each subgraph has the same data model passed between nodes. This makes each subgraph easily extensible (Principle #1). To achieve the other goals, we design our data model passed between different nodes to satisfy the write-once property: each piece of data in the data model can only be written at most once. To enforce this write-once rule by compiler, we introduced immutable data types for these pieces of data to eliminate human errors as early as possible (Principle #3: safe-by-design). In particular, it is easy to prove that a write-once model is sufficient to satisfy the thread-safety and ML feature consistency requirements. With this property, two other design goals are achieved:

Separation of concerns: each node owned by different teams will be updating different pieces of data. This makes it possible for teams to concentrate on their business logic rather than worrying about data mutations by other teams.
Development velocity: the compile-time checks detect issues early in the development process, minimizing the time spent on debugging data inconsistencies in production, which is significantly more complex.

More formally, an immutable data type can be defined recursively as follows:

A primitive data type or immutable Java types such as Integer, String, or enums.
An immutable collection type (list, map, set etc) where each element is of an immutable type. Once an immutable collection object is created, no elements can be inserted or deleted.
A Java class or record, where each field is of an immutable type. Furthermore, the class only exposes getter and no setter methods so that all field values are initialized at construction time. To make it easier for users, these classes usually expose a builder class to construct objects.

The write-once data type can be defined as follows:

An immutable data type is a write-once data type
For each field in a write-once data type, if there are setter methods, they must check the write-once property (and throw an exception if it is violated) in a thread-safe manner
A class is a write-once class if and only if all its field types are of write-once types

One of the difficulties in implementing write-once data models is to be able to identify which pieces of data can be immutable. This requires us to analyze the whole codebase and abstract their usage patterns. With nested complex thrift structures passed around the whole funnel, it is not an easy task, particularly when the code keeps changing during the migration. In the following sections, we introduce two case studies on how we design write-once data models.

Case Study I: Request Level Data

A simple case of a write-once data structure is a request level data, which maintains contextual information passed from client about a request and data accumulated throughout the request handling lifecycle. This includes request ID, session ID, timestamp, experiments, ML features etc. All these pieces of data can be read-only after they are initialized, so it is possible to define a write-once data structure by converting all its fields into immutable data types.

One special case of an immutable data type is how to convert thrift types, which are used extensively throughout the funnel, into immutable types. The open source thrift compiler generates Java class implementations with both setter and getter methods, making it mutable by default. In order to generate an immutable Java class, we developed a custom script that generates immutable Java classes with getters only that return immutable Java objects.

Case Study II: Candidates Table

Another common write-once data structure is the candidate table. This is a columnar table that stores a list of candidates and their properties. Each row represents a candidate, and each column is a candidate property of immutable type. Nodes that run in parallel can read/write different columns as long as they satisfy the read-after-write and thread-safety rules. Below is an illustration of concurrent access of different columns by a read iterator from one node and a write iterator from another node.

Figure 2. Write-once columnar table

Migrations

The migration from Mohawk to AdMixer is separated into three milestones:

Abstract feature expansion to another microservice to ensure data consistency across both Mohawk and AdMixer. This allowed us to ensure that our ML models and ad trimmers were relying on the exact same data, eliminating one confounding variable.
Rewrite all the logic in Java and run AdMixer in parallel with Mohawk.
Verify correctness of the new service through value-based validation, metric-based validation, and live A/B experiment on production traffic.

In the rest of this post, we will focus on the validation process because it is challenging to guarantee correctness without interfering with normal development by 100+ engineers.

Validations

Since AdMixer is meant to be a complete rewrite of Mohawk, the ideal situation is that given one input, both systems produce identical outputs. However, this is not the case due to the following reasons:

Real-time data dependencies: Certain parts of the ad serving path rely on stateful systems. For example, we rely on budget pacing for all campaigns so that their spending curve can be smoothed over a period of time (e.g., one day or one week). This means that even for exactly the same request processed at the same time, the output of candidates from retrieval can be different, since the pacing behavior may change.
Live backends: The ads serving path involves requests to many external services to gather data for downstream nodes. These external services may have some randomness or inconsistencies, and if any of them return different results, Mohawk and AdMixer may produce different outputs.
Data transformations for writes: In addition to the final ad recommendations, we also need to verify that the data written to PubSub topics, caches, etc., have a high match rate. Since Mohawk and AdMixer use different data models internally, they need different transformations to produce the same outputs to these data sinks.

To tackle all these problems, we built a component-wise validation framework and set up a cluster to run validations against real-time traffic. The idea of the component-wise validation framework is illustrated as Figure 3.

Figure 3. Component-wise validation

For each component, we need to first identify the input and output we want to verify. After that, we log both the input and output into a Kafka stream as one validation test case. A validation service driver consumes from the Kafka stream and applies the input to the corresponding AdMixer component, compares AdMixer’s output with the logged Mohawk output, and outputs any detected discrepancies.

There are three possible cases to validate:

If the Mohawk component doesn’t have any external data dependencies, we expect a 100% match rate since the output is a pure function of its input.
If the Mohawk component has external data dependencies, but the external service is mostly consistent (e.g., 99.9% of the time it returns the same result), then we expect the replay of the same input to the AdMixer component should have around 99.9% match rate.
If the Mohawk component’s output is random even with the same input (e.g., the case of pacing), we need to split the verification into two parts: the request generation logic and the response processing. Taking the pacing example, we would first validate that for a given ad request, Mohawk and AdMixer generate identical requests to the retrieval system. Then, we would validate that, given the same retrieval response, the output of the retrieval-subgraph within Mohawk and AdMixer match.

Online Metrics and Experiments

After the value-based validation, we are 99% confident that the input and output of each component is correct. We still need to run end-to-end tests (dark traffic) to ensure that system metrics (e.g., success rate to external services, latencies, etc.) are close to production, and side effects such as logging and caching are not breaking contracts between the ads serving platform and downstream teams (such as ads reporting).

Finally, after achieving the desired validation rates for each component, we ran a live A/B experiment between Mohawk and AdMixer to ensure that various top-line metrics (revenue, ad performance, Pinner metrics, latency, etc.) were not negatively impacted. The whole process is illustrated in Figure 4.

Figure 4. The end-to-end validation process

Wins and Learnings

As mentioned in Part 1 of the blog post, we have achieved all of our goals, including supporting product launches more easily and safely, improving our developer satisfaction by 100%, and even saving significantly on infrastructure costs. We would also like to share some learnings from this journey to help other teams and companies when they are taking on similar challenges.

Double Implementations

One of the biggest challenges in the migration of such a large and actively developed project is how to keep both implementations in sync. The double implementation cost is unavoidable. However, we minimized the overhead by:

Delaying the requirement of double implementation of new features only until the code-compete phase when AdMixer implementation is roughly on par with Mohawk. Before the code-complete phase, all projects on the old Mohawk code base can carry on in parallel.
Implementing a validation service that keeps monitoring the Mohawk-AdMixer discrepancies in real time. This allowed us to swiftly triage the change that caused discrepancies and ensured a timely double-implementation.

One fun fact about the real time validation framework is that it was not thrown away after the migration. Instead we found that it is extremely useful to detect unintentional output with a submitted code change. We expanded our real time validation framework to be the regression test framework to detect data anomalies from all submitted code changes before they are merged to the codebase. This prevents many unnoticed business metrics regressions over time.

Experiment Result Parity

Another critical part in the final launch was achieving parity in our A/B experiment results. All Pinerest ads experiments must go through a comprehensive study on their key metrics to understand what caused them. A small percentage (e.g., 0.5%) difference in these key metrics, such as revenue and engagement rate, may be a blocker to the final launch.

To be able to reason about the final metric differences, we monitor all their upstream metrics and create real-time alerts. These metrics include success rates to external services, cache hit rates, and candidate trim rates. We double implemented all required metrics in AdMixer and relied on these to ensure both services were operating in a healthy, consistent manner. These metrics turned out to be very useful in debugging the final experiment metrics that are hard to capture with value-based validations.

Alignment and Execution

During such a long and arduous project with no intermediate returns, it was extremely important for us to have buy-in from all our critical stakeholders. Thanks to our detailed upfront evaluation, the Monetization leadership team was aligned on the importance of this investment for future business growth and health, which allowed us to push through all the complications, technical and otherwise, that arose during the two-year long project.

Last but not least, persistence in the execution is never short of importance. There were many ups and downs in the two-year journey and the team has successfully delivered under a tight timeline. A huge shoutout is due to all the engineers who worked long days and nights to push this project across the finish line. Their collaborative spirit to Act As One during hard times was invaluable in enabling a successful launch for the new AdMixer service.

Acknowledgements

We would like to thank the following people who had significant contributions to this project:

Miao Wang, Alex Polissky, Humsheen Geo, Anneliese Lu, Balaji Muthazhagan Thirugnana Muthuvelan, Hugo Milhomens, Lili Yu, Alessandro Gastaldi, Tao Yang, Crystiane Meira, Huiqing Zhou, Sreshta Vijayaraghavan, Jen-An Lien,Nathan Fong,David Wu, Tristan Nee, Haoyang Li, Kuo-Kai Hsieh, Queena Zhang, Kartik Kapur, Harshal Dahake, Joey Wang, Naehee Kim, Insu Lee, Sanchay Javeria, Filip Jaros, Weihong Wang, Keyi Chen, Mahmoud Eariby, Michael Qi, Zack Drach, Xiaofang Chen, Robert Gordan, Yicheng Ren, Luman Huang, Soo Hyung Park, Shanshan Li, Zicong Zhou, Fei Feng, Anna Luo, Galina Malovichko, Ziyu Fan, Jiahui Ding, Andrei Curelea, Aayush Mudgal, Han Sun, Matt Meng, Ke Xu, Runze Su, Meng Mei, Hongda Shen, Jinfeng Zhuang, Qifei Shen, Yulin Lei, Randy Carlson, Ke Zeng, Harry Wang, Sharare Zehtabian, Mohit Jain, Dylan Liao, Jiabin Wang, Helen Xu, Kehan Jiang, Gunjan Patil, Abe Engle, Ziwei Guo, Xiao Yang, Supeng Ge, Lei Yao, Qingmengting Wang, Jay Ma, Ashwin Jadhav, Peifeng Yin, Richard Huang, Jacob Gao, Lumpy Lum, Lakshmi Manoharan, Adriaan ten Kate, Jason Shu, Bahar Bazargan, Tiona Francisco, Ken Tian, Cindy Lai, Dipa Maulik, Faisal Gedi, Maya Reddy, Yen-Han Chen, Shanshan Wu, Joyce Wang,Saloni Chacha, Cindy Chen, Qingxian Lai, Se Won Jang, Ambud Sharma, Vahid Hashemian, Jeff Xiang, Shardul Jewalikar, Suman Shil, Colin Probasco, Tianyu Geng, James Fish

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.