From Python3.8 to Python3.10: Our Journey Through a Memory Leak

Image generated with ChatGPT (OpenAI), 2025.

When working with Python, memory management often feels like a solved problem. The garbage collector quietly does its job, and unlike C or C++, we rarely think about malloc or free. This doesn’t mean that there are no memory leaks in Python. Reference cycles, unreleased resources like connection pooling, global caches, etc can slowly inflate your process’s memory footprint. You might not notice it at first, until your worker starts OOM-ing, latency creeps up, or container restarts become mysteriously frequent.

In this post, we’ll share the story of a real-world memory leak we encountered during a Python upgrade — how we discovered it, the tools and techniques we used to investigate, and the lessons we learned.

Back in the summer of 2024, we had an initiative at Lyft to upgrade all of our Python services from v3.8 to 3.10 as v3.8 was scheduled to be EoL by the end of 2024. You can find more details on how our awesome Backend Foundations team at Lyft does Python upgrade across hundreds of repos at scale here. The upgrade involved two phases: the first phase was to upgrade all the dependencies to be Python 3.10 compatible, and the second phase was to upgrade the services to Python 3.10. The dependency upgrades went smoothly for all services and then the phase to upgrade all services to Python 3.10 rolled out. While all services were running Python 3.10 smoothly, there was one service for which the upgrade in the test environment caused a flurry of latency spikes, resulting in timeouts for downstream services.

Increasing 5xx caused by timeouts after upgrading to Python 3.10

After profiling the APIs with increased latency with stats, we found that the source of latency were repository queries to the dynamo tables. Specifically, we had pynamodb based repository queries which would spin up a bunch of greenlets to fetch data from multiple tables and combine the result which was showing increased timeouts. The individual queries themselves were fine; however, it was the thread join which took the longest time causing the worker to timeout (default = 30 seconds).

Individual Dynamo queries taking < 100 ms to finish

Gevent thread join takes 30 secs

The other interesting thing we found was memory consumption slowly creeping up with time in all of the pods.

Memory usage % of all pods

At this point, we weren’t sure if there was something up with gevent/greenlet causing the memory leak or the memory leak causing the latency since decreased memory availability can cause increasing page fetches from the disk. We first checked if the gevent monitoring thread detected any event loop blocks, which could potentially cause these timeouts. We then pivoted to find out the root cause of the memory leak. Fortunately, Lyft has an internal library which can help profile memory which is based on tracemalloc.

The Lyft memory profiling tool is based on tracemalloc. To capture the memory trace for a given gunicorn process, we registered the worker process to listen to USR2 signal during the application initialization phase.

# app/__init__.py

MemoryProfiler().register_handlers()

# mem_profiler.py pseudo code

class MemoryProfiler:
def __init__(self) -> None:
self._state_machine = self._profiling_state_machine()

 def register\_handlers(self) -> None:  
    # Register gunicorn worker to listen to USR2 to dump traces  
    signal.signal(signal.SIGUSR2, self.handle\_signal)    

def handle\_signal(self, signum: signal.Signals, frame: FrameType) -> None:  
    next(self.\_state\_machine)  

def \_profiling\_state\_machine(self) -> Generator\[None, None, None\]:  
    while True:  
        try:  
            self.start\_tracing() # tracemalloc.start()  
            self.memory\_dump() # Create snapshot1  
            yield  
            self.memory\_dump() # Create snaphot2,compare with snapshot1, and dump the difference in a file   
        finally:  
            if tracemalloc.is\_tracing():  
                tracemalloc.stop()

Let’s start the tracing!

Ok, now that we had the memory profiler setup, we are ready for some tracing to find the source of the leak. To start the tracing, we send USR2 signal to the gunicorn process in the K8s pod to start tracing and send the signal again after some time interval to capture the stack trace with highest memory usage.

ps aux

Initial process list before sending USR2 signal

Now, we will send a USR2 signal to worker with pid 12

kill -USR2 12

Upon checking the process list again….

ps aux

Tracing killing the gunicorn worker with PID=12

we observed that the gunicorn process we planned to trace got killed 🙁

It took several hours of debugging and a journey back to one of my favorite class to find the root of the issue — preload. To understand why preload caused the process to be killed, we first need to understand how gunicorn works.

Gunicorn

Gunicorn works on the pre-fork model. There is a leader process which forks a bunch of workers. There are two ways to fork the workers:

No Preload

Gunicorn forked workers with no preload

When the leader process forks a worker, the worker has its own application code. This results in the worker process having a larger memory footprint than the leader.

smem -a - sort=pid -k

Service with no preload: Worker PSS mem = ~203MB

With Preload

Gunicorn forked workers with preload

Preload is a memory optimization based on the concept of copy-on-write. Essentially, the workers share the imports and application code with the leader and only modified pages are written to the worker’s memory.

smem -a - sort=pid -k

Service with preload: Worker PSS mem reduced to ~41MB!!

So how does preload play a role with USR2 signal killing the process?

If you remember, we registered the signal during the app initialization by calling register_handlers().

# app/__init__.py

MemoryProfiler().register_handlers()

# mem_profiler.py

class MemoryProfiler:

def register\_handlers(self) -> None:  
    # Register gunicorn worker to listen to USR2 to dump traces  
    signal.signal(signal.SIGUSR2, self.handle\_signal)

Since the app had preload=True, only the leader process was registering the USR2 signal to handle the tracing. The worker process did not register due to copy-on-write and that causes any kill -USR2 to actually kill the process!

Now that we have figured out that preload caused the process to be killed, we turn off the preload option and start the tracing again.

ps aux

Initial process list before sending USR2 signal

kill -USR2 12

Successful USR2 signal not killing the gunicorn worker

The worker does not get killed!

We created a script which iterates through all the K8s pods and sends a USR2 signal to all the workers to start the tracing and resends the signal to stop the tracing after a certain time interval. The trace had a lot of false positives since it collects dumps which may not necessarily be the source of the leak, but have not been garbage collected yet.

The most interesting (and common) memory dump trace after sifting through hundreds of them was the following:

Stack trace dump from memory profiler

If you remember the initial conclusion we had with the following graph, we knew that the increase in timeouts had something to do with pynamodb and gevent/greenlets since we saw thread joins taking a long time:

Our initial observation of gevent thread join takes 30 secs

The stack trace combined with the graph above, narrowed down the issue to pynamo/botocore. After digging online, we found the following issue with urllib3 v1.26.16. Essentially, in a highly concurrent environment using gevent, connections were not being returned to the pool which caused the pool to sit at its max size and block further requests. This particular stack trace confirmed our suspicion:

Stack trace showing botocore/urllib3/connectionpool

The root cause of the issue was some incompatibility between weakref.finalize and gevent’s monkey patching causing non-deterministic deadlock which made the issue hard to reproduce. The immediate fix was to downgrade the urllib3 version to 1.26.15, after which the timeouts and the memory leak were gone!! The actual fix which ensures urllib3 connection pooling is cooperative was released in April 2025 and we have seen no issues upgrading both gevent to v25.4.1 as well as urllib3 to 1.26.16+.

It is unclear though why the Python version upgrade exposed the issue. In fact, urllib3 upgrade was not part of the dependency upgrade we had done to prepare for the Python 3.10 upgrade! We had actually been running Python 3.8 with urllib v1.26.16 for about a year without any problem. Ironically, we had upgraded to v1.26.16 specifically because it logged the total connections whenever connection pools were full.

  1. If you run into memory leaks which are affecting the live production system, you can use gunicorn’s max-request settings which recycles the worker processes after N requests. This ensures your process or container does not run into OOM. While this helps mitigate the issue, it is critical to continue investigating the source of the memory leak.
  2. Gevent monitoring thread has an option to print trace for greenlets which exceed a certain memory threshold. While I have personally never tried this, it could help find objects which are holding large amounts of memory, but not necessarily the source of a leak.

There is no silver bullet to debugging memory leaks; it is a hard issue to debug them. There a few things you can look for eg. unbounded global caches, unreleased resources tied to database/network pooling, recently upgraded libraries, etc. If you check the actual gevent/urllib3 issue, none of them talked about memory leaks, only timeouts. We just happened to run into a memory leak and try to find the root cause of it 😀

Lyft is hiring! If you’re passionate about efficient database connection management, visit Lyft Careers to see our openings.

- 위키
Copyright © 2011-2025 iteam. Current version is 2.148.2. UTC+08:00, 2025-12-19 05:36
浙ICP备14020137号-1 $방문자$