Redesigning Pinterest’s Ad Serving Systems with Zero Downtime

Ning Zhang; Principal Engineer | Ang Xu; Principal Machine Learning Engineer | Claire Liu; Staff Software Engineer | Haichen Liu; Staff Software Engineer | Yiran Zhao; Staff Software Engineer | Haoyu He; Sr. Software Engineer | Sergei Radutnuy; Sr. Machine Learning Engineer | Di An; Sr. Software Engineer | Danyal Raza; Sr. Software Engineer | Xuan Chen; Sr. Software Engineer | Chi Zhang; Sr. Software Engineer | Adam Winstanley; Staff Software Engineer | Johnny Xie; Sr. Staff Software Engineer | Simeng Qu; Software Engineer II | Nishant Roy; Manager II, Engineering | Chengcheng Hu; Sr. Director, Engineering |

Introduction

The ads-serving platform is the highest-scale recommendation system at Pinterest, responsible for delivering >$3B in yearly revenue and making it one of the most business critical systems at the company! From late 2021 to mid-2023, the Ads Infra team, along with several key collaborators, redesigned and rewrote this system entirely from scratch to address years of tech debt and lay the foundations for the next 5+ years of audacious business goals. In this blog post, we will describe the motivations and challenges of this rewrite, along with our wins and learnings from this two year journey.

Overview of the Pinterest Ads Serving System

The ad serving service sits in the center of Pinterest’s ad delivery funnel. Figure 1 (below) depicts a high level overview of Pinterest’s first version of the ads serving system called “Mohawk”. It took a request from the organic side and returned top-k ad candidates to be blended into organic results before being sent to users for rendering. Internally it acted as a middleware that connected other services, such as feature expander, retrieval, and ranking, and finally returned the top-k ads to users.

Figure 1. Overview of the Pinterest ad serving system

Motivations

Rewriting the service at the heart of the business is an expensive and risky endeavor. This section describes how we arrived at this decision.

Mohawk, implemented in 2014, was Pinterest’s first ad serving system. During its eight-year lifespan, Mohawk became one of the most complex systems at Pinterest. As of 2022, Mohawk:

  • Served more than 2 billion ad impressions per day and generated $2.8 billion in ad revenue
  • Handled ad requests from a dozen user-facing surfaces, serving hundreds of millions of Pinners in over 30 countries
  • Relied on 70+ backends for feature/data fetching, predictions, candidate generation, bidding/pacing/budget management, etc.
  • Has more than 380K lines of code and 200+ experiments that are modified by more than 100 engineers from different teams

As our ad business and engineering team grew rapidly, Mohawk accumulated significant complexities and tech debt. These complexities made the system increasingly brittle, resulting in several eng-weeks lost in resolving outages.

Many of the incidents were not because of obvious code bugs, which made them hard to be captured by unit tests or even integration tests. They were caused by fundamental design flaws in the platform such as:

  1. Close coupling of infra frameworks and business logic: Simple application logic changes required a deep knowledge of the infra frameworks.
  2. Lack of proper modularization and ownership: Features or functionality that should have lived in individual modules were collocated in the same directories/files/methods, making it hard to define a good code ownership structure. It also resulted in conflicting changes and code bugs.
  3. No guarantees of data integrity: The Mohawk framework did not support the enforcement of data integrity constraints, e.g., ensuring that ML features are consistent between serving and logging.
  4. Unsafe multi-threading: All developers could freely add multi-threaded code to the system without any proper frameworks for error handling or race conditions, resulting in latent software bugs that were hard to detect.

In Q3 2021, we started a working group to decide whether a complete rewrite or a major refactor was due.

Decision Making

It took us three months to research, survey, prototype, and scrutinize different options before finally making a decision to rewrite Mohawk into a Java-based service. The final decision was mainly based on two points:

  1. A major refactor in place may take more time than rewriting from scratch. One reason is that the refactor of an online service needs to be broken down into many small code changes, many of which need to go through rigorous experiments to make sure they do not cause any regressions or outages. This can take days to weeks for each experiment. On the other hand, a complete rewrite can achieve higher throughput before the final A/B experiment phase.
  2. Pinterest organic mixers are all built on a Java-based framework. Rewriting the AdMixer service using the same framework would open the door to unifying organic and ads blending for deeper optimization.

With agreement from all Monetization stakeholders, the AdMixer Rewrite project was kicked off at the end of 2021.

Design Principles

The goal of the AdMixer Rewrite project was to build an ads platform that enabled hundreds of developers to build new products and algorithms for rapid business growth while minimizing the risk to production health. We identified the following Engineering Design principles to help us build a system that would achieve this goal:

  1. Easily extensible: The framework and APIs need to be flexible enough to support extensions to new functionalities as well as deprecation of old ones. Design-for-deprecation is often an omitted feature, which is why technical systems become bloated over time.
  2. Separation of concerns: Separation of infra framework by defining high level abstractions that business logic can use. Business logic owned by different teams needs to be modularized and isolated from each other.
  3. Safe-by-design: Our framework should support the safe use of concurrency and the enforcement of data integrity rules by default. For example, we want to enable developers to leverage concurrency for performant code while ensuring there are no race conditions that may cause ML feature discrepancy across serving and logging.
  4. Development velocity: The framework should provide well-supported development environments and easy-to-use tools for debugging and analyses.

Design Decisions

With these principles in mind, designing a complex software systems required us answer these two key questions:

  1. How do we organize the code so that one team’s change does not break another team’s code?
  2. How do we manage data to guarantee correctness and desired properties throughout the service?

To respond to the above questions, we need to fully understand the current business logic, how data is manipulated, and then build a high level abstraction on top of it. Figure 1 depicts such a high level example of code organization. Code can be represented into a directed acyclic graph (DAG) structure. Each node represents a logically coherent piece of business logic. The edges between them represent data dependencies between them. Data is passed from upstream to downstream nodes. With the graph structure, it is possible to achieve extensibility and development velocity due to better modularity. To achieve safe-by-design, we also need to guarantee that the data passed through the graph is thread-safe.

Based on the above desired end state, we made two major design decisions:

  1. use an in-house graph execution framework called Apex to organize the code into DAGs, and
  2. build an innovative data model that is passed in the graph to guarantee safe execution.

Due to the space constraints, we simply summarize the final results here. We encourage interested readers to refer to the second part of the blog post for the detailed design, implementations, and migration verifications.

Summary

We are proud to report that the AdMixer service has been running live in production for almost three full quarters, with no significant outages as part of the migration. This was a huge achievement for the team, since we launched right before the 2023 holiday season, which is traditionally the most critical part of the year for our ads business.

Looking back at the goals we set up at the beginning: to speed up product innovations safely with a large team, we are happy to report that we have achieved all goals. The Monetization team has already launched several new product features in the new system (e.g., our 3rd party ads partnership with Google was developed entirely on AdMixer). We have grown to have more than 280 engineers contributing to the new codebase. Our developer satisfaction survey (NPS) score has nearly doubled from 46 to 90, indicating extremely high developer satisfaction! Finally, our new service is also running on more efficient hardware (AWS Graviton instances), which resulted in several million dollars of infra cost reduction.

In the second part of the blog post, we are going to discuss the detailed design decisions and the challenges we have encountered during the migration. We hope some of the learnings are helpful to similar projects in the future.

Acknowledgements

We would like to thank the following people who had significant contributions to this project:

Miao Wang, Alex Polissky, Humsheen Geo, Anneliese Lu, Balaji Muthazhagan Thirugnana Muthuvelan, Hugo Milhomens, Lili Yu, Alessandro Gastaldi, Tao Yang, Crystiane Meira, Huiqing Zhou, Sreshta Vijayaraghavan, Jen-An Lien,Nathan Fong,David Wu, Tristan Nee, Haoyang Li, Kuo-Kai Hsieh, Queena Zhang, Kartik Kapur, Harshal Dahake, Joey Wang, Naehee Kim, Insu Lee, Sanchay Javeria, Filip Jaros, Weihong Wang, Keyi Chen, Mahmoud Eariby, Michael Qi, Zack Drach, Xiaofang Chen, Robert Gordan, Yicheng Ren, Luman Huang, Soo Hyung Park, Shanshan Li, Zicong Zhou, Fei Feng, Anna Luo, Galina Malovichko, Ziyu Fan, Jiahui Ding, Andrei Curelea, Aayush Mudgal, Han Sun, Matt Meng, Ke Xu, Runze Su, Meng Mei, Hongda Shen, Jinfeng Zhuang, Qifei Shen, Yulin Lei, Randy Carlson, Ke Zeng, Harry Wang, Sharare Zehtabian, Mohit Jain, Dylan Liao, Jiabin Wang, Helen Xu, Kehan Jiang, Gunjan Patil, Abe Engle, Ziwei Guo, Xiao Yang, Supeng Ge, Lei Yao, Qingmengting Wang, Jay Ma, Ashwin Jadhav, Peifeng Yin, Richard Huang, Jacob Gao, Lumpy Lum, Lakshmi Manoharan, Adriaan ten Kate, Jason Shu, Bahar Bazargan, Tiona Francisco, Ken Tian, Cindy Lai, Dipa Maulik, Faisal Gedi, Maya Reddy, Yen-Han Chen, Shanshan Wu, Joyce Wang,Saloni Chacha, Cindy Chen, Qingxian Lai, Se Won Jang, Ambud Sharma, Vahid Hashemian, Jeff Xiang, Shardul Jewalikar, Suman Shil, Colin Probasco, Tianyu Geng, James Fish

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-22 15:16
浙ICP备14020137号-1 $访客地图$