Finding zombies in our systems: A real-world story of CPU bottlenecks

One day in early 2025, the Kubernetes platform team at Pinterest (PinCompute) got a ping from our partners on the ML platform team. Their Ray-based training jobs , which often take hours of computation on expensive GPU hardware, were crashing. Not every time, but often enough that it was becoming noticeable. Their logs indicated that their distributed training jobs were seeing intermittent loss of network connectivity, and that ultimately caused their jobs to crash. Their ask was simple:

Why is this happening?
Can you please make it stop?

What started there led to a more than three-month-long investigation and a great lesson in profiling performance bottlenecks. Read on to learn from our fun story about CPU bottlenecks, AWS network drivers, and yes, how we discovered Zombies in our system!

Background: Ray at Pinterest

At Pinterest, Ray has risen as the backbone of our next-gen ML training and inference. Over the past few years, it has enabled us to scale systems, accelerate experimentation, and significantly boost the performance of models powering our diverse ML workloads.

We have previously shared deep dives on our progress, including: Ray Infrastructure (provisioning ray cluster on in-house K8s clusters at scale [blog]), Batch Inference with Ray (scaling to hundreds of nodes [blog][talk]), Ray for Training (distributed dataloaders and throughput optimization [talk]), and Last-Mile Data Processing (reducing experimentation cycles [blog 1][blog 2]).

Today, we run more than half of the offline ML workload company-wide on Ray, provisioning tens of thousands of Ray clusters per month, a feat made possible only by a robust Kubernetes environment.

Network Model & Challenges

Figure 1: Ray architecture at Pinterest

What makes the network stability challenging lies in Ray’s unique network model.

Ray operates as a highly “network-active” system. A Ray cluster generates constant, intensive inter-pod gRPC traffic that is fundamental to the cluster’s operation, with the following two distinct layers:

Control Plane: Handles stateful operations, such as node health check, task submission, actor scheduling, and the maintenance of Object References.
Data Plane: Handles the high-volume transfer of values within the Object Store. Our Large-scale ML training relies on this plane to move data rapidly between nodes.

Because this traffic is highly distributed and latency-sensitive, the impact of network instability is often non-deterministic, manifesting across various components of Ray Cluster:

Job Hanging: Caused by actor state corruption following brief network interruptions. [github issue]
ObjectFetchTimedOutError / ObjectLossError
ActorDiedError
Node failed the health check and crashed
…

All of these occurrences resulted in one common outcome: our Ray Training jobs would crash (some use cases with > 25% Success Rate drop), resulting in loss of expensive compute hours and significant slowdown in Model building and experimentation. After grinding for over a month seeking solutions for individual issues in the Ray stack, the ML Platform team realized it was necessary to turn our attention to look for more lower level network issues with our friends from the PinCompute team.

Symptom 1: Network driver resets

At Pinterest, our Kubernetes clusters are backed by AWS EC2 instances, which leverage the ENA Network driver (ref) as a standard traffic component. This Network driver works with AWS Elastic Network Interfaces (ENIs) and sets up receive and transmit queues for buffering packets. Our first symptom that something was wrong was identifying that whenever the ML training jobs failed with network connectivity issues, it correlated with a Network driver ‘reset’, as seen in our system logs.

[] ena 0000:20:03.0 eth0: TX q 5 is paused for too long (threshold 5000000). Time
since last napi 6596000 usec. napi scheduled: 1
[] ena 0000:20:03.0 eth0: napi handler hasn't been called for a long time but is scheduled
# .... Bunch of stats excluded....
[] ena 0000:20:03.0: ENA Large LLQ is disabled
[] ena 0000:20:03.0: Device reset completed successfully, Driver info: Elastic Net
work Adapter (ENA) v2.11.0g

From the reference docs:

Q: What is [the] ENA device reset?

A: ENA device reset is a self healing mechanism that is triggered when the driver detects unexpected device behavior. Example of such behavior could be an unresponsive device, missing keep-alive events from the device, Tx completions timeouts*, netdev timeout etc. The device reset is a rare event, lasts less than a millisecond and might incur loss of traffic during this time, which is expected to be recovered by the transport protocol in the instance kernel.*

Ok, so the driver saw Tx threads paused for an extended period of time (hardcoded to 5s in AWS ENA Kernel drivers), and caused the device to be reset, which could cause some packet drops. A typical reason for resets was documented as CPU starvation, i.e, when the Network driver’s threads don’t get CPU time for several seconds. So perhaps something CPU intensive was starving out the Network driver threads?

Symptom 2: CPU utilization

Our next observation was that some of the machines where we saw network resets exhibited high system CPU usage and that correlated nicely with the CPU starvation theory in the ENA documentation. We speculated that our training jobs were leveraging inefficient memory allocators and that was resulting in High page faulting.

Figure 2: Page faults per second on impacted machines

We did what many reasonable people would do:

We tried using Huge pages (by turning on TransparentHugePages) to reduce page faulting.
We experimented with more efficient memory allocators like jemalloc
We tried to give the training jobs their own CPU cores by providing them CPU affinity via taskset.
Out of desperation, we played with interrupt pinning for ENA drivers by steering network interrupts to other cores.

Nothing worked. While we saw some drops in overall CPU utilization and page faulting from the memory allocators and huge pages settings, the network resets continued. They sometimes happened very early in a training job run and sometimes several hours into their execution. Across 100s of training job runs, it was hard to predict when exactly we’d see a network reset, if at all.

One mitigation did work, albeit briefly and it’s everyone’s favourite IT crowd advice: Yes, we turned it off and on again. When we rebooted machines with high amounts of resets, they were able to support running ML jobs just fine.. that is until they weren’t. We clocked it at approximately one week of uptime, after which the network resets returned on the rebooted machines.

Symptom 3: Availability zone differences

To further understand the problem, the ML platform team started emitting metrics whenever an ENA reset was observed. Once the metrics were available, the team noticed something odd — the network resets were happening on machines in one AWS Availability zone only and all their jobs with identical parameters were running just great on other zones.

Figure 3: Network resets per Availability Zone

The PinCompute team runs zonal clusters (one Kubernetes cluster per Availability zone) but when the team looked at our cluster configurations across different zones, they seemed identical. They were running the same version of Kubernetes and the same system image. So, did we get a bad hardware batch!? We reached out to our excellent AWS support team and after several engagements, were convinced that the issue was definitely not on the AWS side. Their analysis was clear: there was something on our machines in the us-east-1a zone, which was heavily using the CPU and causing the network threads starvation. So why would one availability zone’s machines only exhibit this network reset behaviour?

Profiling attempts: perf and mpstat

We decided it was time to stop with high level metrics and start profiling what was actually using the high amounts of CPU. Performance engineers know all about perf and its versatility. perf is a Linux profiler that can provide insights into ‘hot’ code paths and a call stack indicating CPU time spent by a particular process on a machine. Initially, our rudimentary snapshots of perf revealed the same suspected actors: Page faulting and some heavy computation from our ML jobs. However, this didn’t indicate CPU starvation all on its own.

Figure 4: perf snapshot on an impacted machine

We realized that for CPU starvation to happen, it may take as little as one CPU core to be heavily utilized and block an unlucky network thread that was scheduled onto that core. Moreover, we realized that our GPU machines had 96 vCPU cores, which meant that an overall perf view told us very little about what was happening in each individual core.

To address this, we used mpstat to get an overview of per core utilization on a per-second basis for an hour to identify if specific cores were using up large amounts of CPU. In our offline analysis, we found that sometimes, a single CPU core (in the following screenshot, CPU 39) was often using 100% of its system CPU for multiple seconds! This also correlated with when a network reset happened. We were finally closing in on the root cause!

Figure 5: 100% System CPU utilization on a single core (Core 39) when profiled per second.

Given these network resets were happening at unpredictable times and we lacked perf runs from the times of the reset, we were still missing one key detail: what process was using up the CPU for this extended period of time?

Temporal profiling: Time is an important factor

We realized that if there was a sporadic process (think something in your crontab or some kind of periodic sync loop in a process) that was causing high CPU utilization at specific times on the machine, then a random perf sample wouldn’t tell us about that. We needed a tool like gProfiler to be running for an extended period of time and then ‘time travel’ to a specific point in time to look at what was happening on the CPU cores at that time. Unfortunately, at the time of this incident, we didn’t have gProfiler running everywhere within our fleet, but the principles were sound! Thanks to some creative setup from our ML platform team, we created the following experimental setup:

1. Reserved a small number of machines (via Kubernetes taints) for analysis

2. Kicked off a series of training jobs in parallel on these machines. For simplicity, we repurposed our in-house Hyper-parameter tuning to orchestrate identical model training across reserved machines, allowing each training run’s resource footprint to remain fairly constant.

3. Kicked off a script that ran perf in 2 minute increments with profiles and CPU stacks data saved to disk. The script looked a bit like this and ran on all of our reserved machines as a system process.

# Bash program to generate CPU stacks snapshots on a machine. 

# Run perf record for 2 minutes at a time, since each perf data file can become very large for longer periods. Record the start time in the filename for 'time traveling' later! Running this 360 times covers roughly a 12 hour period of profiles
$ for i in {1..360} 
  do 
    sudo perf record -F 97 -g -a -o perf-$(hostname)-$(date +"%Y%m%d-%H-%M-%S")-120s.data -- sleep 120  
  done

# Generate perf stacks
$ for datafile in \`ls perf-*\` 
  do 
    perf script --header -i $datafile > $datafile.stacks
  done

4. We ran the data collection overnight (~12 hours) and waited for a reset to be triggered. Since our ML training jobs typically ran for 8–12 hours, we were confident that we would observe a reset over this period across at least a subset of the training jobs.

Sure enough, when we came to analyze the data the next day, we found that network driver resets had been triggered along with Job failures. Unlike before, we now had perf data to examine from the time of the reset! We fetched the perf results for the 2 minute time window around the reset event and visualized it with the excellent Flamescope tool, courtesy our friends at Netflix. Flamescope allows us to view a 2 minute CPU stack with a time travel view, allowing us to zoom into a subset of the time window and observe what was happening on the CPU at that time. From the ENA reset logs, we found that the reset had happened about 70 seconds into this profile, so we zoomed in to a 5 second region from the high-level view around the reset.

Figure 6: Temporal high-level view of CPU utilization from flamescope. X-axis is time from 0–120 seconds for the 2 minute snapshot

Our first observation was that the kubelet, our lightweight Kubernetes agent, was occupying ~6.5% of total CPU usage a few seconds before an ENA reset. This was alarming and interesting because the rest of the time, the Kubelet barely broke 1% of CPU usage.

Figure 7: Profile of the CPU just before ENA resets. Notice the high kubelet utilization.

We zoomed in a bit deeper and found that the kubelet was spending a lot of time on a system call: mem_cgroup_nr_lru_pages.

Figure 8: Zoomed in profile of the CPU stacks for the kubelet process

We now had a suspect: something was causing the Kubelet to iterate over all the memory cgroups on the host and spending significant time on the CPU. At the same time when we were researching this, we came across this excellent post on the Oracle blog describing the problem of zombie memory cgroups. Could we be running into this problem? Fortunately, that blog post guided us perfectly and we saw the following on a network driver resetting machine:

# Kernel tracked cgroups (including zombies)

$ cat /proc/cgroups | grep memory | awk '{print $3}'
68680

# Actual cgroups

$ find /sys/fs/cgroup/memory/ -type d  | wc -l
240

Yup, we definitely had zombies! Nearly 70,000 memory cgroups tracked in the Kernel but only 240 actually in use. Iterating over that long list in the system call was likely what was causing the CPU utilization spikes on a single core and if a network thread landed on that core at just the right time, it could become starved! But what was causing the high build up of memcgs?

Beware of system defaults

Our theory at this point was that the build up memcgs was from some crashlooping container, which kept re-creating cgroups and leaking memcgs. We didn’t see any such container created by Kubernetes but spotted a container that was always only a few seconds old when we queried the docker API:

$ docker ps -a
CONTAINER ID   IMAGE                                                                                                                       COMMAND                  CREATED          STATUS                             PORTS     NAMES
c6fdfc760921   amazon/amazon-ecs-agent:latest                                                                                              "/agent"                 11 seconds ago   Up 10 seconds (health: starting)             ecs-agent

Why was the Amazon ECS Agent running (and repeatedly crashing!) in our Kubernetes nodes? This was certainly unintentional given ECS is an alternative container orchestration platform that we weren’t using. It turns out that for our GPU instances, we were leveraging the AWS Deep Learning AMI (Ubuntu 20.04) as a base machine image and it set up ecs-agent as a default systemd unit. As part of the machine’s bootstrap process, it also started the ECS agent, which over several days of crashing accumulated a massive amount of memory cgroups. The ECS Agent was correctly crashing since we did not give our machines permissions to join an ECS cluster and so it was natural that the container failed to start up. This also explained why rebooting the machines gave us temporary relief because rebooting reset the memcg counts!

We fixed the issue by simply turning off the ECS agent systemd unit in our base images and rebooting all our machines to purge the zombie memcgs. After this, we noticed that memory cgroups remained stable and most importantly, Ray Training jobs were running with their expected high success rate again. The problem of ENA resets and the zombies in our machines was fully resolved and our ML training teams could go back to building awesome new models to serve Pinterest customers!

Hold on! What about the availability zones disparity?

Oh.. right. Well, erm, we messed up a little. See, when we said that the two node configurations were identical across the two clusters, that was only mostly true. For our Kubernetes cluster in the unaffected availability zone, we had an independent bug where we delivered the same Kubernetes binary via two different URLs to the two clusters. Long story short, the difference in URLs caused a last step that emitted a metric to fail and caused the node bootstrap script to get marked as failed. This prevented the ECS agent from starting up because its systemd unit depended on the bootstrap script to successfully complete, which in turn allowed the nodes to remain ‘healthy’, at least from the perspective of not accumulating memcgs! The Kubernetes team was aware of this different URL issue and was independently working on fixing that as well, which in turn would have brought the network reset issue to the unaffected Availability zone as well.

Key Takeaways

Introducing fleet wise metrics to track transient issues on the Platform is helpful to identify failure patterns. In this case, it helps us understand that the issue was correlated to AZ/Cluster setup, further leading us to isolate and consistently reproduce the problem.
Create reproducible, closed environments for iterative debugging. In our case, the partnership between the PinCompute and ML Platform teams to set up debugging experiments was critical to quickly identifying the root cause of the issue.
Invest in profiling tools and especially temporal profiling tools! They’re great and will save you hours and hours when working on hard-to-debug performance problems. At Pinterest, we’re developing and rolling out gProfiler in close collaboration with Intel for debugging situations like this.
Be aware of what processes are running on your base OS images. Sometimes, the defaults aren’t necessarily the right ones for your environment. Invest in profiling the success rate of your systemd units and watch out for the impact of regular failures.
When looking at differences between two environments that look the same but act differently, look closer.. You’re probably missing some piece of configuration that is causing the two paths to diverge. Better yet, invest in good automated tooling to ensure your environments are truly identical.