Finding and optimizing N+1 queries on a relational database

At Mixpanel, relational databases play a key role in the architecture of our web application. They store essential data, including user accounts, settings, and reports information. Given the database’s critical function, it is in a constant state of change, with many different teams working around. This interaction is vital for our dynamic development, but it also presents challenges.

One major challenge is the N+1 query problem. Essentially, what looks like a straightforward data fetch can trigger an avalanche of unnecessary queries, creating a domino effect that slows everything down and dampens the user experience. But here’s the good part, our team invested time and work on this and saw some impressive gains, boosting performance up to 10x in certain areas. By pinpointing and beefing up those slower endpoints, we’ve significantly upped our app’s performance and responsiveness.

How to find N+1 queries?

Most of our API endpoint management and relational database interactions leverage the Django framework. In this section, we’ll use a small Django API specifically created for this blog post with two simple models. You can find the complete code on this github repository.

A Board has many Report(s). Our API exposes this data through a /boards endpoint, which returns a JSON object like this.

{
"<board_id>": {
"id": <board_id>,
"name": <board_name>,
"reports": [
{
"id": <report_id>,
"name": <report_name>,
},
]
}
}

We populated the database with 2,000 boards, each one having 10 reports, and since our goal is to identify and fix N+1 query issues, we intentionally wrote the view code in a particular way.

def all(request): response = {}

for board in list(Board.objects.all()):

response[board.id] = {
"id": board.id,
"name": board.name,
"reports": [], }

for report in board.reports.all():

response[board.id]["reports"].append({
"id": report.id,
"name": report.name })

return JsonResponse(response)

After querying this endpoint a couple of times, we got an average response time of 3.21 seconds, which is not ideal. Let’s try to improve this.

Adding some logs

Django’s default logging is pretty basic, it shows the HTTP method, endpoint path and response code for each API call. It is a good first step, but it doesn’t give us any insight over what happened during the execution.

Which request is faster or slower? How many queries is each one doing? We can’t know.

⚠️ Going forward, we’ll implement code enhancements to Django logs, which might affect performance. Therefore, we advise using these tools and configurations exclusively in the development environment.

Django also give us the option of logging all the SQL queries that occurs in a request, we just have to add the following to our settings.py file.

LOGGING = {

'version': 1,

'filters': {
'require_debug_true': {
'()': 'django.utils.log.RequireDebugTrue', } },

'handlers': {

'console': {
'level': 'DEBUG',
'filters': ['require_debug_true'],
'class': 'logging.StreamHandler', } },

'loggers': {

'django.db.backends': {
'level': 'DEBUG',
'handlers': ['console'], } }

}

These are the logs for our requests. We can quickly see that there is an N+1 query problem somewhere, but it’s hard to know exactly where.

Sure, we can go this way if we’re testing one specific endpoint, but usually APIs get multiple different calls all the same time.

So, let’s go one step forward…

Django Querycount

Django Querycount is a powerful library that adds a summary table after each request. The table displays information about the number of queries performed during each API request. The library has a lot of config options (and we encourage you to check them out), for our case, we are going to use one called DISPLAY_DUPLICATES, which tells you what was the most duplicated query on each request.

⚠️ The library’s GitHub page clearly states, “Notice that django-querycount is hard coded to work only in DEBUG mode set to true,” indicating its use is restricted to development mode only.

QUERYCOUNT = {

"DISPLAY_DUPLICATES": 1

}

Let’s see how this looks on our project.

Look at these colors, it makes so much easier to spot a problematic endpoint. You can also customize the thresholds for each color on the settings.

We identified where the problem is, and we also learned that we are looking for a SELECT query on the “boards_reports” table, which in Django world is the Report model.

Debugging

Now that we know what endpoint to look at, we can use a tool like postman to send only that request.

(Note: Did you know you can copy requests from your browser to postman? If you didn’t, you should check this guide).

One way to go would be to add a python breakpoint until we find the line that triggers the query. This is fine and would work, but it can be very time consuming, especially for projects with lots of utils and libraries, so let’s improve our logging tools a bit more.

Our starting point, the default Django log output

Better formatting

Reading SQL queries can get difficult very quickly, specially when we start adding JOINS and nested queries, that’s why after some investigation we decided to try this very useful post, which improved our logs drastically by adding a custom log formatter. This formatter adds two big changes

SQL formatting, using sqlparse
Syntax highlight, thanks to Pygments

Printing the line of code

One last thing we added was to also log the line of code which triggers the query. It was relatively easy to get from python’s traceback, and the added value was huge.

With these two additions, the same log looks like this.

The custom formatter and the updated settings.py code are just below.

import logging
import re
import traceback

def get_line():

try:
last_line = [line for line in traceback.format_stack() if "myproject" in line][-3]
pattern = r"\"(.*?)\", line (\d+),.*?\n(.*?)\n"
match = re.search(pattern, last_line)
if match:
file_path, line_number, next_line = match.groups()
return f"{file_path}:{line_number}\n {next_line}"
except:
return ""

class SQLFormatter(logging.Formatter):

def format(self, record):

try:

import pygments
from pygments.lexers import SqlLexer
from pygments.formatters import TerminalTrueColorFormatter
except ImportError:
pygments = None

try:

import sqlparse
except ImportError:
sqlparse = None sql = record.sql.strip()

if sqlparse:

sql = sqlparse.format(sql, reindent=True)

if pygments:

sql = pygments.highlight(

sql, SqlLexer(), TerminalTrueColorFormatter(style="default")

) statement = sql line_executing_code = get_line()

if line_executing_code:

statement = f"{line_executing_code}\n{sql}" record.statement = statement

return super(SQLFormatter, self).format(record)

LOGGING = {

'version': 1,

'filters': {
'require_debug_true': {
'()': 'django.utils.log.RequireDebugTrue', } },

'formatters': {

'sql': {
'()': 'formatters.SQLFormatter',
'format': '%(asctime)s [%(duration).3f] %(statement)s', }, },

'handlers': {

'sql': {
'class': 'logging.StreamHandler',
'formatter': 'sql',
'level': 'DEBUG', }, },

'loggers': {

'django.db.backends': {
'level': 'DEBUG',
'handlers': ['sql'], } }

}

The fix

Once the problem is identified, Django provides a very simple solution. In the vast majority of cases, we can prefetch the data we need using either select_related or prefetch_related. In our example, we did this in a new endpoint /prefetch which looks like this.

def all_prefetch(request): response = {}

boards = Board.objects.prefetch_related(Prefetch('reports', to_attr='reports_list'))

for board in list(boards):

response[board.id] = {
"id": board.id,
"name": board.name,
"reports": [], }

for report in board.reports_list:

response[board.id]["reports"].append({
"id": report.id,
"name": report.name })

return JsonResponse(response)

This simple change improved the response time in our sample project from 3.2s to 0.8s, that is 25% of the original time, or 2.56 times faster!

Real Mixpanel cases

We have been using this exact same methodology, and the results have been great. To this date, we have already optimized a good amount of endpoints, some of them very critical.

Case 1

While working on a new feature, we noted that one of the critical endpoints we use to load reports was not performing well.

Endpoint summary table before optimization

After following the steps we previously described, we managed to decrease the number of queries from 5000+ to just 21.

A few days later, when we reviewed this endpoint again to assess its impact, we discovered that the response time had improved by 1.5 to 2.5 times for all our customers.

Case 2

Similarly, while working on a different project, one of our engineers spotted a second N+1 query on our boards endpoint, one of the most heavily accessed endpoints at Mixpanel.

We knew something had to be wrong, so, again, using the tools we described in this post, we were able to quickly reduce the number of queries from 9000 to 30, just fixing N+1 query cases.

In this case the problem had a greater effect on specific types of projects, resulting in a significant impact for those clients.

For the most affected customers, the response time improved from ~22.5s to ~2.2s, that is approximately 10 times faster. For the rest of customers, the response times went from ~3s to ~0.8s, almost 4 times better.

Final thoughts

N+1 queries can significantly impact latency and may be difficult to identify early in a project. Fortunately, there are many tools we can add to our development environment which make this task much easier.

Now at Mixpanel, our new tools enable us to identify these issues at an early stage, and we’re also making ongoing improvements to our monitoring system to detect these problems more quickly.