I have been experimenting with FastAPI recently, a Python API framework self-describing as "high performance, easy to learn, fast to code, ready for production".
One of the features I wanted my project to have is to be fully asynchronous, from the app server to the SQL requests. As the API is mostly I/O bound, this would allow it to handle many concurrent requests with a single server process, instead of starting a thread per request, as one commony seen with Flask/Gunicorn.
However, this poses a challenge when it comes to profiling the code and interpreting the results.
The limitations of cProfile
when profiling asynchronous code
For example, the following graph representation was generated from a cProfile
profile recording 300 consecutive calls to a single API endpoint, with an associated get_character
handler.
Zooming in, we notice 2 things about the get_character
span:
- its
ncalls
value is 9605, when we really called it 300 times - it is free-floating, completely unlinked from any other span
As an asynchronous function is "entered" and "exited" by the event loop at each await
clause, every time the event-loop re-enters a function, cProfile
will see this as an additional call, thus causing seemingly larger-than-normal ncalls
numbers. Indeed, we await
every-time we perform an SQL request, commit or refresh the SQLAlchemy session, or anything else inducing asynchronous I/O.
Secondly, the reason that the get_character
span appears to be free-floating is because it is executed outside of the main thread, by the Python event-loop.
This means that our good old faithful cProfile
might not cut it for this inherently asynchronous server, and we need a more modern profiler with builtin asynchronous support if we want to really make sense of where time is spent during a request.
Enter pyinstrument!
pyinstrument
is a statistical profiler, contrary to cProfile
, which is deterministic.
Deterministic profiling is meant to reflect the fact that all function call, function return, and exception events are monitored, and precise timings are made for the intervals between these events (during which time the user’s code is executing). In contrast, statistical profiling [...] randomly samples the effective instruction pointer, and deduces where time is being spent. The latter technique traditionally involves less overhead (as the code does not need to be instrumented), but provides only relative indications of where time is being spent.
Second, it advertises native support for profiling asynchronous python code:
pyinstrument
can profile async programs that useasync
andawait
. This async support works by tracking the context of execution, as provided by the built-incontextvars
module.When you start a
Profiler
with theasync_mode
enabled or strict (not disabled), thatProfiler
is attached to the current async context.When profiling,
pyinstrument
keeps an eye on the context. When execution exits the context, it captures the await stack that caused the context to exit. Any time spent outside the context is attributed to the that halted execution of the await.
This should allow us to get a sensible picture of where time is spent during the lifespan of a FastAPI request, while also skipping the spans that are too fast to be profiled.
Integrating pyinstrument with FastAPI
We rely on the FastAPI.middleware
decorator to register a profiling middleware (only enabled if the PROFILING_ENABLED
setting it set to True
) in charge of profiling a request if the profile=true
query argument is passed by the client.
By default, this middleware will generate a JSON report compatible with Speedscope, an online interactive flamegraph visualizer. However, if the profile_format=html
query argument is passed, then a simple HTML report will be dumped to disk instead.
from fastapi import Request
from pyinstrument import Profiler
from pyinstrument.renderers.html import HTMLRenderer
from pyinstrument.renderers.speedscope import SpeedscopeRenderer
def register_middlewares(app: FastAPI):
...
if app.settings.PROFILING_ENABLED is True:
@app.middleware("http")
async def profile_request(request: Request, call_next: Callable):
"""Profile the current request
Taken from https://pyinstrument.readthedocs.io/en/latest/guide.html#profile-a-web-request-in-fastapi
with small improvements.
"""
# we map a profile type to a file extension, as well as a pyinstrument profile renderer
profile_type_to_ext = {"html": "html", "speedscope": "speedscope.json"}
profile_type_to_renderer = {
"html": HTMLRenderer,
"speedscope": SpeedscopeRenderer,
}
# if the `profile=true` HTTP query argument is passed, we profile the request
if request.query_params.get("profile", False):
# The default profile format is speedscope
profile_type = request.query_params.get("profile_format", "speedscope")
# we profile the request along with all additional middlewares, by interrupting
# the program every 1ms1 and records the entire stack at that point
with Profiler(interval=0.001, async_mode="enabled") as profiler:
response = await call_next(request)
# we dump the profiling into a file
extension = profile_type_to_ext[profile_type]
renderer = profile_type_to_renderer[profile_type]()
with open(f"profile.{extension}", "w") as out:
out.write(profiler.output(renderer=renderer))
return response
# Proceed without profiling
return await call_next(request)
You can browse the project code to see how the middleware is wired into the application itself
Let's see the results
HTML profile
Speedscope profile
We see pretty clearly the different SQL requests being performed (the execute
spans), the different await
clauses in the code causing the event loop to pause the execution, and that most of the request time is spent in SQL requests.
Finally, using this setup, I was able to observe the effects of replacing the json
stdlib library by orjson
when deserializing JSON data from database, and speed up this endpoint by a couple of percent very easily.
Comments
9 Comments
Hi Balthazar! Thanks for the blog entry, it was actually quite helpful. There are not many resources around improving performance in FastAPI, so the content is really appreciated. I've experienced issues when running FastAPI in small servers, usually around serialization/deserialization. Currently I'm migrating to Pydantic2 which should improve things. Something that really helped me was using custom JSON responses instead of the default one, to avoid the revalidation of data before returning a response.
Thanks a lot for this article
Thanks a lot, you saved me a lot of time. All best
Thanks for sharing this.
I don't know if this is intentionally, but in your code, the
call_next
method is called twice when profiling is enabled. Is this intentionally? IMO anelse
statement is missing in your code, so that in case that profiling is enabled, the functioncall_next
is just called once.Hi Fabio! Indeed, this was a mistake, thanks for spotting it! I was able to reproduce locally and I've since pushed a correction to the article. Thanks again!
Hi Balthazar! This blog post was very useful for CPU profiling in Python. Do you have any idea how can we similarly do a memory profiling in FastAPI? e.g. Using tools like Memray?
I'm not sure, to be honest, as I'm not too familiar with the memory profiling tooling in Python. I'd probably start researching something memory profiling for async python webservices, and see whether I could integrate whatever I find with FastAPI, the same way I did it for CPU profiling. Maybe something already exists?
Hi Nicolas!
If you wanna improve even more the FastAPI performance you can replace the JSONReponse to ORJSONReponse.
Thanks for sharing this. It was really helpful!