Node.js和工作线程的故事
I do not usually read code when dealing with production incidents, as it is one of the slower ways to understand and mitigate what is happening. But on that Friday night, I was glad I did.
通常在处理生产事故时,我不会阅读代码,因为这是理解和解决问题的较慢方式之一。但在那个星期五晚上,我很高兴我这样做了。
I was about to start another session of Elden Ring (a video game in which everything is pretty much trying to kill the player) when I was paged with the following: "campaign service is consuming all resources we throw at it". I joined a call and was then told that the observed impact was due to one of the dependencies: the translation service, for which my on-call rotation was responsible for. The translation service was indeed very slow to respond (its p99 latency had increased from 100ms to 500ms) and its error rate had gone from 0 to 4%. This did not really explain why the service calling us (the campaign service) was on a cloud resource consumption spree.
我正准备开始另一场 Elden Ring 的游戏(这是一个几乎所有东西都试图杀死玩家的视频游戏),突然接到了以下信息:“campaign 服务正在消耗我们投入的所有资源”。我参加了一个电话会议,然后被告知观察到的影响是由其中一个依赖项引起的:翻译服务,而我负责的是这个服务的值班轮班。翻译服务的响应速度确实非常慢(其 p99 延迟从 100ms 增加到 500ms),错误率从 0 增加到 4%。这并没有真正解释为什么调用我们的服务(campaign 服务)在云资源消耗上疯狂。
We started with distributed tracing, however the campaign service was not instrumented so we could not get much out of our tracing tooling. We did see some context cancelled
error messages on our request spans which usually means that the connection was unexpectedly closed from the client side. We quickly moved on to logging and sure enough, we found the same evidence in the translation service logs: java.lang.IllegalStateException: Response is closed
我们从分布式跟踪开始,但是活动服务没有被仪表化,所以我们无法从我们的跟踪工具中获得太多信息。我们确实在请求跨度中看到了一些context cancelled
的错误消息,这通常意味着连接意外关闭了。我们迅速转向日志记录,果然,在翻译服务日志中我们找到了相同的证据:java.lang.IllegalStateException: Response is closed
We are relatively well instrumented at Zalando in terms of operations, especially with built-in Kubernetes dashboards. Using our Kubernetes API Monitoring Clients dashboard we confirmed that the calling service (the campaign service) was misbehaving an...