Debugging Deadlock in PininfoService Ubuntu18 Upgrade: Part 1 of 2
Author: Kangnan Li | Software Engineer, Key Value Systems
Unblock deadlock for PininfoService
This is part 1 of a two-part blog series.
Reading both parts of this series will give you insight into some debugging techniques we are using in the Pinterest Engineering Key Value Systems team (a team derived from the previous Serving System). Related projects owned by this team can be seen in blogs and presentations on Terrapin, Rocksplicator (1 and 2), Aperture and Realpin.
In this series, we divide the contents as below:
Part 1
- How to set up a test/canary environment
- How to design tests with control variables
- How to design tests for root/leaf (ie. routing/persistence layer) separation
Part 2
- How to generate a heap dump
- How to use GDB for debugging a running process
Pinterest is a platform with more than 400M monthly active users, which means engineers have a chance to work on big data problems at scale. PininfoService (PIS) is one of Pinterest’s many backend services.
PininfoService is a rootless read-only cache service serving >10M batch requests per second per cluster to support Pinterest Homefeed, Related Pins, and others. It fetches customer-defined signals from various source data services (i.e. upstreams) and holds the data with customized time-to-live (TTL).
Note: In this article, we use root, leaf, rootless to classify systems architecture. Root, also called routing layer, is responsible for routing requests and aggregating responses. Leaf (i.e. persistence layer) is responsible for serializing/deserializing thrift struct data and writing to, getting from the local cache storage. Rootless — in rootless architecture, each node acts as both root and leaf roles. Each node act as root role to receives client requests as root request and route requests to nodes (might be others, might be itself) hosting the data based on data distribution map (ie. shard map); and at the same time each node also receives and processes requests routed from the root node as leaf requests to interact with cache and retrieve source data.
Background
As Ubuntu14 reached its end of lifetime in 2019, we were tasked with upgrading the operating system from Ubuntu14 (U14) to UbuntU18 (U18) for PIS. What made this difficult is the fact that many of our systems are stateful, which means that a simple cluster rotation is not a feasible option. Therefore, an in-place upgrade was determined to be least disruptive to our availability.
To ensure a seamless upgrade, we split the upgrade process into three steps:
- change the service code to make it U18 compatible;
- deploy the U18 service build with dark traffic to ensure comparable results; and
- in-place upgrade the U14 instances to U18.
We have successfully adopted this three-step process to in-place upgrade more than 10,000 instances of our stateful services.
Since we don’t yet have the architecture to support a complete clone (e.g. Envoy shadow mirroring) regarding data and traffic of the production cluster for testing, we designed two substeps (test + canary environment) to verify the U18 builds:
Test environment: a test cluster with desired build and runtime configuration. A few hosts from the production environment are put into a separate pool to forward a controllable percentage traffic to the test cluster in a “fire and forget” manner.
Canary Environment: a few hosts are put into a “canary” environment. This canary environment continues to serve the same amount of production traffic, while enabling us to deploy with different build and runtime configurations.
Scheme 1. Test/Canary environment setup.
Observation & Hypothesis
During the test with the test setup (Scheme 1), we observed the following two issues from the test cluster:
- QPS drop to 0
- Inconsistent memory usage. A PIS U14 production host usually has 160–200GB memory usage based on configuration. While on the test cluster, some hosts were stuck at <50GB while some increased to ~500GB causing OOM.
First we needed to determine: Are these two independent issues or are they caused by one underlying issue? Gathering the timeline and stats of the QPS drop and memory usage of the testing hosts, it was not conclusive whether these two issues were caused by one underlying issue.
Graph 1. Stats of two U18 hosts with test environment setup in Scheme1.
Hypothesis
With the above mentioned issues in mind, we investigated the following questions to help probe the issues:
- Is there a code bug due to U18 syntax change? This was ruled out by double checking the code changes, a sample U18 upgrade code change, refer to rocksplicator/pull request 336.
- Does runtime configuration need re-tuning? It is not uncommon that cache services need tuning to best fit the production traffic based on hardware supplied.
- Root or leaf, which layer is the problem?
- Memory leak? Considering the memory usage sometimes increased to 500GB causing OOM on the U18 host, it is possible some memory leaks might be introduced or some existing memory bugs in U14 are not triggered till U18. (refer to part2 of this series)
Experiments & Analysis
Our next step was to design experiments to test and verify the above hypotheses. After ruling out a code bug, we focused on narrowing down the problematic part by obtaining hints via re-tuning runtime parameters and separating root/leaf layers.
While conducting experiments, we try to control variables. If the hypothesis is that factor A causes the issue, we design the test to keep non-A factors the same and only vary factor A. Then this test could indicate whether factor A is positively/negatively correlated or not related at all.
Re-tune Runtime Configurations
Based on our understanding of the service, we tuned six runtime params in eight tests with the test environment setup in Scheme 1: each test run is performed with three hosts by adjusting the runtime configurations. The test run with the old U14 configurations (ie. Test0–0 in Scheme 2) is as the control group:
Scheme 2. Tuning runtime parameters for U18 test environment.
In Test0–1 to Test0–6, runtime configurations were tuned one at a time, and better results were obtained from all tests, which indicated (as hint1) some correlation between thread pools and the two issues. Eventually, we finalized the tuning with Test0–7, which delivered the best performance so far with less often QPS drop, and memory usage remained stable at 200GB when compared to Test0–0. However, issues are not completely resolved, thus we decided to continue on other tests with Test0–7 configs as the optimized runtime configurations.
Root/Leaf Isolation
PIS is rootless, which means each node is both root and leaf. To narrow down the source of the issue, we performed tests against root-only or leaf-only nodes.
Scheme 3. Root / leaf isolation test.
In the above Test1–1 & Test1–2, the experimental group is the “root-only” or “leaf-only” hosts, while the control groups are the 2 remaining U18 rootless hosts in the same test environment. Test results show that issues remained on the leaf-only host while the root-only host did not experience the issues anymore, thus we narrowed down that the leaf logic is more likely responsible for the process crash and inconsistent memory usage. Therefore, following tests will be conducted with the setup as Test1–2 (1 leaf-only + 2 rootless hosts).
Note: During Test1–2, the two issues undetermistically occured on either the experimental (i.e. leaf-only) or control (i.e. rootless) group. Thus, several runs were performed to gather stable test results.
In part 1, we have addressed the first three questions raised at the Hypothesis section. Next, we will focus on the memory debugging assisted with a few tools, which will be summarized in part 2 of this series.
Summary of Part 1
In this article, we shared how we set up test and canary environments to debug issues encountered during service U18 upgrade. We adopted control variables to set up tests (Test0–0 to Test0–7) to re-tune runtime configurations which gave us some hints that the thread pools might have been related to the issues observed: QPS drop to 0 and inconsistent memory usage — increase or remain low. Then, Test1–1 and Test1–2 were designed to narrow down that the issue was from the leaf layer instead of the root layer.
In part 2, we will discuss how we used tools to probe the issues, rule out the memory leak, and eventually determine the root cause of the issue based on the “thread pool hints (hint1)” described in this article.
Acknowledgement
Key Value Systems’ U14 to U18 migration is a great effort over several months among engineers: Kangnan Li, Rakesh Kalidindi, Carlos Castellanos, Madeline Nguyen and Harold Cabalic, which in total completes upgrading >12K stateful instances of >7 services. Special thanks for Bo Liu, Alberto Ordonez Pereira, Saurabh Joshi, Prem Kumar, Rakesh Kalidindi for your knowledgeable inputs and help on this debugging process. Thanks Key Value team manager Jessica Chan, tech lead Rajath Prasad for your support on this work.
To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page