查询再次发起

On Thursday, 12 Oct. 2022, the EMEA part of the Datastores team — the team responsible for Slack’s database clusters — was having an onsite day in Amsterdam, the Netherlands. We’re sitting together for the first time after new engineers had joined the team, when suddenly a few of us were paged: There was an increase in the number of failed database queries. We stopped what we were doing and staged-in to solve the problem. After investigating the issue with other teams, we discovered that there was a long-running job (async job), and that it was purging a large amount of database records. This caused an overload on the database cluster. The JobQueue team — responsible for asynchronous jobs — realized that we couldn’t stop the job, but we could disable it completely (this operation is called shimming). This meant that the running jobs wouldn’t stop, but that no new jobs would be processed. The JobQueue team installed the shim, and the number of failed database queries dropped off. Luckily, this incident didn’t have an impact on our customers.

2022年10月12日星期四,Datastores团队的EMEA部分(负责Slack的数据库集群)在荷兰阿姆斯特丹进行了现场日。我们第一次聚在一起,新工程师加入团队后,突然有几个人被呼叫:数据库查询失败的数量增加了。我们停下手头的工作并进行了分析以解决问题。在与其他团队调查问题后,我们发现有一个长时间运行的作业(异步作业),它正在清除大量的数据库记录。这导致数据库集群超载。负责异步作业的JobQueue团队意识到我们无法停止该作业,但可以完全禁用它(这个操作称为shimming)。这意味着正在运行的作业不会停止,但不会处理新的作业。JobQueue团队安装了shim,数据库查询失败的数量下降了。幸运的是,这个事件对我们的客户没有影响。

The very next day, the Datastores EMEA team got the same page. After looking into it, the team discovered that the problem was similar to the one experienced the day before, but worse. Similar actions were taken to keep the cluster in working condition, but there was an edge-case bug in Datastores automation which led to failure to handle a flood of requests. Unfortunately, this incident did impact some customers, and they weren’t able to load Slack. We disabled specific features to help reduce the load on the cluster, which helped give room to recover. After a while, the job finished, and the database cluster operated normally again.

第二天,Datastores EMEA团队收到了相...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.125.1. UTC+08:00, 2024-05-17 10:31
浙ICP备14020137号-1 $访客地图$