Debugging Deadlock in PininfoService Ubuntu18 Upgrade: Part 2 of 2

Pinterest Engineering

Published in

Pinterest Engineering Blog

7 min readMar 9, 2022

Solving Engineering Problems as Doing Research

Kangnan Li | Software Engineer, Key Value Systems

This is part 2 of a two-part blog series on deep systems debugging techniques in a real-world scenario to upgrade our stateful systems to U18.

In part 1, we narrowed down that the two issues observed — QPS drop and inconsistent memory usage — are from the PininfoService leaf layer. In this article, we narrow down the issue further to GlobalCPUExecutor (GCPU) and eventually the root cause of the issue: a deadlock.

To better understand how requests flow in and out of PininfoService, here is a brief summary of threads (or pools) in order used in PininfoService (also refer to Thrift intervals to learn how fbthrift server works):

Thrift Acceptor Thread: accept connection from clients
ThriftIOPool: process data in/out via established connections between PininfoService and clients who send requests to PininfoService
ThriftWorkerPool: the thread manager provided in the PininfoService logic to process aync_tm_<API> function calls
GlobalCPUExecutor: a global CPU pool to delegate the heavy lifting work, such as processing the response from upstream data stores
ThriftClientPool: pool of clients to talk to upstream data stores

We will now dive deeper into how we utilize tools to debug the two issues observed (QPS drop and inconsistent memory usage), with particular focus on the memory issue.

Digging into the Why: QPS Drop and Inconsistent Memory Usage

To figure out why the QPS dropped to 0, we used the “tcpdump” tool to obtain following traces

from outgoing packets that show load shedding from thrift servers (tcpdump).

$ sudo tcpdump -nn -i eth0 -A port 9090 -c 400 18:58:06.607828 IP 10.3.42.154.9090 > 10.3.19.45.60262: Flags [P.], seq 1055:1117, ack 7460, win 686, options [nop,nop,TS val 2486825918 ecr 507682922], length 62 E..r^.@.@… .*. ..-#..fK..Y&..&….R1….. .9…B.j…:……………ex.1….b..getMany..loadshedding request…

18:58:06.607828 IP 10.3.42.154.9090 > 10.3.19.45.60262: Flags [P.], seq 1055:1117, ack 7460, win 686, options [nop,nop,TS val 2486825918 ecr 507682922], length 62
E..r^.@.@…
.*.
..-#..fK..Y&..&….R1…..
.9…B.j…:……………ex.1….b..getMany..loadshedding request…

Loadshedding Analysis with Five Whys

At Pinterest, we use “the five whys” as a crucial root cause analysis framework. Here is how we have applied it to this scenario:

The PininfoService outgoing packets contain “load shedding”

Why? This is likely due to hosts not able to process the amount of thrift requestWhy are hosts not able to process? Thrift server has active_req (130K) > max request (~65K), thus load sheddingWhy is there such a high active_req queue? High active_req could either be due to: 1) traffic increase causing the thrift server overloading, or 2) deadlock which cause requests to be queued up and not passed on to the ThriftWorkerPool threads which aim to execute the async_tm_<API> function calls, otherwise QPS stats would have been reported from these API logics.- We are able to rule out 1) since all the system metrics such as CPU, memory are still very low.- Thus, it’s likely a deadlock. But how does deadlock form? We need to investigate and answer this “why”. We turned to the inconsistent memory usage to try to gather more hints into the problem.

Memory Usage Debugging

We have also used the BPF tool to verify that no memory leak was detected. Next, heap dumps were obtained from a host with increasing memory usage via two tools: the jemalloc/jeprof (github) and tcmalloc/gperftools (Gperftool wiki). Both tools provide similar functionalities for heap dumping and generate similar results.

Data from Gperftool is shown below:

Graph 2. Heap profiling snippet of U18 leaf-only host with Gperftool (tcmalloc)

Compared to a healthy U14 host, this extra high percentage of heap consumption from GCPU along with the above “five why analysis” indicates (as hint2) that there might be “something wrong” with the GCPU. Combining this with our discovery described in Part 1 of this series that thread pool runtime parameter tuning has a big impact mitigating the two issues, we are persuaded to believe that thread pools might be related to the issues, particularly GCPU.

Debugging with GDB

During one Test1–2 run, “top” is used to inspect the thread status of a U18 leaf-only host experiencing “QPS drop to 0” and a healthy U14 host. While the healthy U14 host has ThriftWorkerPool, ThriftClientPool, and GCPU threads actively executing tasks and consuming CPU/memory, most threads from the U18 test host are fairly idle.

Then, GDB is used to attach the “stuck” PininfoService process to probe the threads as below.

There are four types of threads that need to be probed:

“pininfo-thrift”, the ThriftIOPool threads. Stack trace shows idle.
“work-pri-3”, the ThriftWorkerPool worker threads managed by ThreadManger. Stack track shows idle.
“g-cpu”, the thread from a GCPU. Stack trace as below.
“io-<upstream>”, the thread from ThriftClientPool. Stack trace as below.

Check one “g-cpu” thread

#7 folly::EventBase::runInEventBaseThreadAndWait(folly::Function<void ()>) (this=this@entry=0x7fdd58737b80, fn=…) at /opt/folly/folly/io/async/EventBase.cpp:618 #8 0x00007fdd91d855a8 in folly::EventBase::runImmediatelyOrRunInEventBaseThreadAndWait(folly::Function<void ()>) (this=0x7fdd58737b80, fn=…) at /opt/folly/folly/io/async/EventBase.cpp:625 #9 0x0000564765af81f6 in common::ThriftClientPool<rockstore::RockstoreAsyncClient, false>::getClient(folly::SocketAddress const&, unsigned int, std::at

The g-cpu thread is executing the code from ThriftRouter, which is blocking waiting to remove a ClientStatus (ie. a thrift upstream client) which requires to enqueue a Deleter task from thrift_client_pool.h to the ThriftClientPool. Apparently, the ThriftClientPool for “io-RockstoreAsync” is not available, thus g-cpu thread is waiting on this.

Check a ThriftClientPool thread “io-RockstoreAsy”

#2 0x00007fdd91c7f559 in folly::ThreadPoolExecutor::joinStoppedThreads (this=0x5647672c73a0 <common::getGlobalCPUExecutor()::g_executor>, n=2) at /usr/include/c++/8/bits/shared_ptr_base.h:1018 #3 0x00007fdd91c80435 in folly::ThreadPoolExecutor::ensureJoined (this=this@entry=0x5647672c73a0 <common::getGlobalCPUExecutor()::g_executor>) at /opt/folly/folly/executors/ThreadPoolExecutor.cpp:393 #4 0x00007fdd91c804f6 in folly::ThreadPoolExecutor::ensureActiveThreads (this=this@entry=0x5647672c73a0 <co

The ThriftClientPool thread is executing a network call for MultiGetRequest, and it tries to enqueue the “process_response” task on the GCPU queue. It executes the GCPU code to check if it needs to join stopped GCPU threads before enqueuing this task. Then it continues to join the stopped GCPU threads and waits on the std::thread to join (blocking!). But, the “stopped’’ GCPU threads can’t be joined successfully because they are waiting on the ThriftClientPool to execute the Deleter to deallocate the thrift client inside the GCPU thread. Thus, GCPU and ThriftClientPool are blocking at waiting on each other, forming a deadlock as shown below.

From left to right are ThreadWorkerPool, ThriftClientPool and GlobalCPUExecutor(Pool). The deadlocks form when threads from ThriftClientPool are blocking waiting on the availability of GCPU; while GCPU is blocking on the ThriftClientPool. — ***Scheme 4***. Deadlock due to dynamic CPUThreadPoolExecutor

Through checking the git blames for ThreadPoolExecutor.cpp, the suspicious “ThreadPoolExecutor::ensureJoined” was added about four years ago by commit (68a6b5), which is between our U14 and U18 folly versions. This pull request added “dynamic CPUThreadPoolExecutor”, which lazily created CPUthreads and joined threads after a timeout and is very likely what happened in our case. Since this dynamic feature could be disabled by “ — dynamic_cputhreadpoolexecutor=false”, we re-run the test (reusing Test1–2 setup in part1) by adding this flag. QPS remained stable, and memory usage increased to ~200GB and remained stable, thus the issues (QPS drop to 0 or inconsistent memory usage) were resolved! By removing this “dynamic CPUThreadPoolExecutor” change, when ThriftClientPool tries to enqueue an “process_response” task and the GCPU queue is full, it won’t block on GCPU’s fullness. The enqueue operation will failover to the ThritClientPool thread itself and complete the enqueue operation, thus preventing the deadlock.

Rollout the U18 Build to One Host of the U14 Production Environment

After obtaining optimistic results from the test environment, the U18 build with the following optimized runtime configurations is rolling out to the canary environment with one production host (Scheme 1 in part1):

— block_on_global_cpu_pool_full=false — global_cpu_pool_queue_size=4194304 — task_expire_time_millis=200 — global_worker_threads=36 — num_server_io_threads=18 — num_worker_threads=36 — dynamic_cputhreadpoolexecutor=false

This single host rollout is meaningful because this host will receive the same traffic as every other U14 production host in the current production environment setup. After successfully running for over two days with comparable performance between U18 to U14 hosts, we have applied in-place upgrade to the whole service and successfully upgraded to U18 without any client impact.

Summary of Part 2

In this article, we adopted tools to take heap dumps with jemalloc or tcmalloc, which further pointed out that the GCPU thread pool is consuming extra heap usage. Finally, we used GDB to pinpoint the deadlock, which was due to a new feature (ie. dynamic CPUThreadPoolExecutor) added in a later Folly version that was used by our U18 docker image. After disabling this feature, we were able resolve the issues and successfully complete the U18 upgrades.

Lessons and Final Result

Looking back, we realized that either results from “re-tune runtime parameters” tests (as hint1) or heap dumps from “memory usage debugging” (as hint2) pointed to issues with the GCPU, while looking forward it takes us lots of effort to pinpoint the GCPU and eventually use GDB to find the deadlock.

By disabling the dynamic CPUThreadPoolExecutor, we unblocked the U18 upgrade. But work remains to be done to resolve following questions:

What causes the GCPU queue to be full in the first place?
What benefits do we lose by disabling the dynamic GCPU queue feature?

Debugging keeps us on a continuous cycle of learning. We’ll be tackling these and other questions in an ongoing effort to continue to strengthen our skills and deepen our understanding of our systems.

To wrap up this case study, with the last service (PininfoService) unblocked and successfully upgraded to U18, combined with other efforts to migrate another ~12K stateful instances serving read-only data, we were able to upgrade a fleet of >24K stateful instances from U14 to U18 without any production-impacting incidents.

Acknowledgement

Key Value Systems team’s U14 to U18 migration is a great effort over several months among engineers: Kangnan Li,Rakesh Kalidindi, Carlos Castellanos, Madeline Nguyen and Harold Cabalic, which in total completes upgrading >12K stateful instances. Special thanks for Bo Liu, Alberto Ordonez Pereira, Saurabh Joshi, Prem Kumar, Rakesh Kalidindi for your knowledgeable inputs and help on this debugging process. Thanks Key Value team manager Jessica Chan, tech lead Rajath Prasadfor your support on this work.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog, and visit our Pinterest Labs site. To view and apply to open opportunities, visit our Careers page.