Pre-Submit Integration Tests For Ads-Serving

Published in

Pinterest Engineering Blog

6 min readSep 19, 2020

Nishant Roy | Ads Serving

Introduction & Background

The ads-serving platform is the highest-scale, highest-complexity, and highest-velocity recommendation service at Pinterest. Our ads business is growing and expanding, and the ads engineering team is iterating quickly to continue to improve the system. Therefore, it is vital to keep the system healthy, in order to protect Pinner experience, business health, and maintain a high developer velocity.

Ever since the first ad was shown on Pinterest, all our ads-serving backend services have been deployed automatically and continuously. Changes are first rolled out to a single machine, which we monitor for very obvious problems like service crashes, or large increases in error logs. Next, we roll out to 1% of the production fleet, where we let the changes rest for two hours, which allows us to detect more nuanced issues like a drop in a certain type of ad, or a large variation in ads from a certain candidate source. It’s very hard to write unit tests to catch these bugs, because the symptoms typically only show up at scale, when the system is serving thousands of requests per second. When there is such a variation in metrics, an alert will be triggered to the on-call engineer, who is responsible for pausing the deploy from rolling out to the full production fleet.

This system worked extremely well early on, but as the ads-serving team grew to nearly 100 developers, it became much harder for a single on-call engineer to find the bug amongst so many code changes. We also have had some cases where the deploy hasn’t been paused in time, and the problematic code rolls out to all production hosts, which can cause impact to our business. With an increase in the number of code-related incidents, and a higher burden on the on-call engineers, we decided to invest in improving our deployment process. We realized that we needed to make developers accountable for fixing their bugs, by stress-testing their code in a production-like environment to identify regressions before they could merge their code.

Our criteria for success were for this tool to:

Capture more nuanced problems that may not be caught by unit tests or local testing
Shift the onus of debugging problematic code from the on-call engineer to the code author

Design

We already have a set of key metrics such as success rate, latency, ad insertions, etc. that we monitor to determine whether a code change is problematic. The driving question was, how do we capture variation in these metrics before the code is merged to production?

We built a pipeline that deploys two versions of the service (the user’s local branch, and the latest committed version). We assume the latest commit is good and does not cause any metrics regression, since it has been tested by our framework. The pipeline then performs a load-test with a copy of production traffic, and compares critical business metrics such as latency, success rate, number of ad insertions, etc. We also built a web interface so engineers can see the metric values, and the exact reason their code change failed the test.

*Figure 1: End-to-end workflow for pre-submit-test*

Now, the onus of debugging bad code shifts from the on-call engineer to the code author. Every time a developer creates a new PR, this test is automatically triggered, and will block the PR from being merged if the test fails. This is a healthier distribution of work, because:

We catch problems on a PR level, rather than at a deploy level that may contain multiple PRs
Developers have more context on their own change than the on-call engineer, making it easier to debug
The on-call engineer can now focus their attention on more pressing production issues

Some interesting aspects of the framework are mentioned below:

Test environment isolation

We had three main requirements for the test environment:

Do not serve live Pinner-facing traffic
Solution: Do not allow these hosts to register to the production serverset
Do not interact with systems serving live traffic, if possible
Solution: Adjust the configuration so we send requests to “playground” or test environments
Do not pollute production metrics
Solution: Disable the background thread that publishes metrics to openTSDB

Test traffic

Since we aren’t serving live Pinner-facing traffic, how do we load test the system to generate metrics?

We log a small sample of all production requests to a Kafka topic. When running the test, we tail this topic, and send these requests to our test and golden hosts. This way, we know our test traffic is a fair approximation of production requests, and both code versions are receiving identical requests, removing potential bias from the test. We can also control traffic load (requests/second), and filter out requests that don’t meet certain criteria.

Metrics Comparison

If we don’t publish metrics to openTSDB, how do we compare the test and golden metrics?

Our ads-serving platform uses the Go expvar package to compute metrics. This package exposes all the metric variables in JSON format via HTTP at the /debug/vars endpoint. Once the load-test is complete, we can access the metrics from this endpoint and compare the metrics from the test and golden hosts to see if there is a significant variation (based on predefined thresholds for each metric).

Web Interface

How do we make this new testing framework as simple and self-serve as possible?

We provided a web interface for the developer to view all the information they need to understand why their code change failed the test. They can see details about their test, including which metrics failed, as well as service logs to understand system behavior. Developers can add log messages to their PR, and they will be able to see them in the pre-submit test web UI, which is really helpful when debugging problems.

*Figure 2: Web interface showing test details, metrics, and logs to help developers understand their results*

Impact

This testing framework makes it easy to capture problems and bugs that are hard to detect through unit tests and local testing. We were able to reduce the number of rollbacks by 30% from Q4 2018 to Q1 2019, after deploying and stabilizing this framework. For new feature launches and incident post-mortems, one of the top recommendations is adding metrics that can be monitored through the pre-submit test to ensure stability and prevent the same issue from occurring in the future.

Today, our pre-submit test for the ads-serving platform monitors around 60 key service metrics. We started out with top-level indicators such as overall success rate and latency, and now have more granular metrics for various stages of the ads-serving request, allowing us to capture more nuanced bugs such as spikes or drops in ad insertions, logging volume, expensive RPC calls, goroutine leaks, error logs, process crashes, and more.

Because of our success on ads-serving, we adapted this tool for some of our peer teams, improving production stability across the ads stack. The ads-indexing team leverages this framework to validate an index of several million ads for each code change, preventing corrupt indexes from being generated, which could have severe impact, and take several hours to resolve. Similarly, the ads-marketplace team uses this framework to ensure the allocation and pricing logic is stable, which could have a significant impact on revenue.

Future Improvements

One limitation of the current pre-submit test system is that it can only be used to test code changes. If someone’s change includes starting a new experiment, it is likely that their experiment is either off or running at an extremely low percentage of traffic, so problems caused by these changes could slip through. It is possible for the developer to change their code to always trigger the experiment changes and run it through the pre-submit test, then change it back before merging their code. However, this is error prone and hard to enforce.

We would also like to leverage the powerful tools provided by Go to profile our system’s performance through this testing framework. This would allow us to capture and visualize performance degradations over time, making it easier to pinpoint the root cause.

Acknowledgements

This project would not have been possible without the help of several members on the ads-infrastructure and release-eng teams. I would like to thank Zack Drach, Liang He, Shu Zhang, Chengcheng Hu, Caijie Zhang, Raymond Xiang, and Wei Zhao for their guidance in the design and execution of this project!