调试Ubuntu18升级中的PininfoService死锁。第2部分(共2篇
Solving Engineering Problems as Doing Research
解决工程问题就是做研究
Kangnan Li | Software Engineer, Key Value Systems
李康南|软件工程师,关键价值系统
unlock deadlock for PininfoService
为PininfoService解开死锁
This is part 2 of a two-part blog series on deep systems debugging techniques in a real-world scenario to upgrade our stateful systems to U18.
这是由两部分组成的博客系列的第二部分,介绍了在真实场景中的深度系统调试技术,将我们的有状态系统升级到U18。
In part 1, we narrowed down that the two issues observed — QPS drop and inconsistent memory usage — are from the PininfoService leaf layer. In this article, we narrow down the issue further to GlobalCPUExecutor (GCPU) and eventually the root cause of the issue: a deadlock.
在第一部分中,我们缩小了观察到的两个问题--QPS下降和内存使用不一致--是来自PininfoService叶层。在这篇文章中,我们将问题进一步缩小到GlobalCPUExecutor(GCPU),并最终找到问题的根本原因:死锁。
To better understand how requests flow in and out of PininfoService, here is a brief summary of threads (or pools) in order used in PininfoService (also refer to Thrift intervals to learn how fbthrift server works):
为了更好地理解请求是如何进出PininfoService的,这里简要介绍了PininfoService中使用的线程(或池)的顺序(也可以参考Thrift间隔来了解fbthrift服务器的工作原理)。
- Thrift Acceptor Thread: accept connection from clients
- Thrift接受器线程:接受客户的连接
- ThriftIOPool: process data in/out via established connections between PininfoService and clients who send requests to PininfoService
- ThriftIOPool:通过PininfoService和向PininfoService发送请求的客户之间建立的连接来处理数据的输入/输出。
-
ThriftWorkerPool: the thread manager provided in the PininfoService logic to process aync_tm_
function calls -
ThriftWorkerPool:PininfoService逻辑中提供的线程管理器,用于处理aync_tm_
函数调用 。 - GlobalCPUExecutor: a global CPU pool to delegate the heavy lifting work, such as processing the response from upstream data stores
- 全局CPU执行器:一个全局CPU池,用于委托繁重的工作,如处理来自上游数据存储的响应。
- ThriftClientPool: pool of clients to talk to upstream data stores
- ThriftClientPool:与上游数据存储对话的客户端池。
We will now dive deeper into how we utilize tools to debug the two issues observed (QPS drop and inconsistent memory usage), with particular focus on the memory issue.
我们现在将深入研究我们如何利用工具来调试观察到的两个问题(QPS下降和内存使用不一致),特别是关注内存问题。