我们都只是在寻找联系

We’ve been working to bring components of Quip’s technology into Slack with the canvas feature, while also maintaining the stand-alone Quip product. Quip’s backend, which powers both Quip and canvas, is written in Python. This is the story of a tricky bug we encountered last July and the lessons we learned along the way about being careful with TCP state. We hope that showing you how we tackled our bug helps you avoid — or find — similar bugs in the future!

我们一直在努力将Quip的技术组件引入Slack的canvas功能,同时也保持独立的Quip产品。支持Quip和canvas的后端是用Python编写的。这是我们去年七月遇到的一个棘手的bug的故事,以及我们在处理TCP状态时学到的教训。我们希望通过展示我们如何解决这个bug,能帮助你避免或发现类似的bug!

Our adventure began with a spike in EOFError errors during SQL queries. The errors were distributed across multiple services and multiple database hosts:

我们的冒险始于在SQL查询期间出现的EOFError错误激增。这些错误分布在多个服务和多个数据库主机上:

A timeseries graph showing distribution of errors across DB group types.

Investigation

调查

The stacktrace showed an asyncio.IncompleteReadError, which we translate to EOFError, when reading the response from the database:

堆栈跟踪显示了一个asyncio.IncompleteReadError,我们在从数据库读取响应时将其转换为EOFError

File "core/mysql.py", line 299, in __read_result_set header = await self.__read_packet(timeout=timeout)
File "core/mysql.py", line 532, in __read_packet header = await self.conn.read_exactly(4, timeout=timeout)
File "core/runtime_asyncio.py", line 1125, in read_exactly raise EOFError() from None

Here’s the relevant code, which had not been touched recently, along with some associated baffled commentary:

以下是相关代码,最近没有被修改过,并附有一些困惑的评论:

async def _perform_query_locked(...) -> core.sql.Result: ... if not self.conn.is_connected(): await self.__connect(...) await self.__send_command(...) result_set = await self.__read_result_set(...) 

A Github comment from Brett: "wow, insane, before every query we check for EOFError preemptively and reconnect if needed, but the stacktrace shows us as failing in the call to __read_result_set(), so we must've already reconnected..."There are a few places where we close our connection to the database, e.g. if we see one of a certain set of errors. One initial hypothesis was that another Python coroutine closed the connection for its own reasons, after we issued the query but before we read the response. We quickly discarded that theory since connection access is protected with an in-memory lock, which...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-10-23 05:36
浙ICP备14020137号-1 $访客地图$