Spin Infrastructure Adventures: Containers, Systemd, and CGroups

The Spin infrastructure team works hard at improving the stability of the system. In February 2022 we moved to Container Optimized OS (COS), the Google maintained operating system for their Kubernetes Engine SaaS offering. A month later we turned on multi-cluster to allow for increased scalability as more users came on board. Recently, we’ve increased default resources allotted to instances dramatically. However, with all these changes we’re still experiencing some issues, and for one of those, I wanted to dive a bit deeper in a post to share with you.

Spin’s Basic Building Blocks

First it's important to know the basic building blocks of Spin and how these systems interact. The Spin infrastructure is built on top of Kubernetes, using many of the same components that Shopify’s production applications use. Spin instances themselves are implemented via a custom resource controller that we install on the system during creation. Among other things, the controller transforms the Instance custom resource into a pod that’s booted from a special Isospin container image along with the configuration supplied during instance creation. Inside the container we utilize systemd as a process manager and workflow engine to initialize the environment, including installing dotfiles, pulling application source code, and running through bootstrap scripts. Systemd is vital because it enables a structured way to manage system initialization and this is used heavily by Spin.

There’s definitely more to what makes Spin then what I’ve described, but from a high level and for the purposes of understanding the technical challenges ahead it's important to remember that:

Spin is built on Kubernetes
Instances are run in a container
systemd is run INSIDE the container to manage the environment.

First Encounter

In February 2022, we had a serious problem with Pod relocations that we eventually tracked to node instability. We had several nodes in our Kubernetes clusters that would randomly fail and require either a reboot or to be replaced entirely. Google had decent automation for this that would catch nodes in a bad state and replace them automatically, but it was occurring often enough (five nodes per day or about one percent of all nodes) that users began to notice. Through various discussions with Shopify’s engineering infrastructure support team and Google Cloud support we eventually honed in on memory consumption being the primary issue. Specifically, nodes were running out of memory and pods were being out of memory (OOM) killed as a result. At first, this didn’t seem so suspicious, we gave users the ability to do whatever they want inside their containers and didn’t provide much resources to them (8 to 12 GB of RAM each), so it was a natural assumption that containers were, rightfully, just using too many resources. However, we found some extra information that made us think otherwise.

First, the containers being OOM killed would occasionally be the only Spin instance on the node and when we looked at their memory usage, often it would be below the memory limit allotted to them.

Second, in parallel to this another engineer was investigating an issue with respect to Kafka performance where he identified a healthy running instance using far more resources than should have been possible.

The first issue would eventually be connected to a memory leak that the host node was experiencing, and through some trial and error we found that switching the host OS from Ubuntu to Container Optimized OS from Google solved it. The second issue remained a mystery. With the rollout of COS though, we saw 100 times reduction in OOM kills, which was sufficient for our goals and we began to direct our attention to other priorities.

Second Encounter

Fast forward a few months to May 2022. We were experiencing better stability which was a source of relief for the Spin team. Our ATC rotations weren't significantly less frantic, the infrastructure team had the chance to roll out important improvements including multi-cluster support and a whole new snapshotting process. Overall things felt much better.

Slowly but surely over the course of a few weeks, we started to see increased reports of instance instability. We verified that the nodes weren’t leaking memory as before, so it wasn’t a regression. This is when several team members re-discovered the excess memory usage issue we’d seen before, but this time we decided to dive a little further.

We needed a clean environment to do the analysis, so we set up a new spin instance on its own node. During our test, we monitored the Pod resource usage and the resource usage of the node it was running on. We used kubectl top pod and kubectl top node to do this. Before we performed any tests we saw

Next, we needed to simulate memory load inside of the container. We opted to use a tool called stress, allowing us to start a process that consumes a specified amount of memory that we could use to exercise the system.

We ran kubectl exec -it spin-muhc – bash to land inside of a shell in the container and then stress -m 1 --vm-bytes 10G --vm-hang 0 to start the test.

Checking the resource usage again we saw

This was great, exactly what we expected. The 10GB used by our stress test showed up in our metrics. Also, when we checked the cgroup assigned to the process we saw it was correctly assigned to the Kubernetes Pod:

Where 24899 was the PID of the process started by stress.This looked great as well. Next, we performed the same test, but in the instance environment accessed via spin shell. Checking the resource usage we saw

Now this was odd. Here we saw that the memory created by stress wasn’t showing up under the Pod stats (still only 14Mi), but it was showing up for the node (33504Mi). Checking the usage from in the container we saw that it was indeed holding onto memory as expected

However, when we checked the cgroup this time, we saw something new:

What the heck!? Why was the cgroup different? We double checked that this was the correct hierarchy by using the systemd cgroup list tool from within the spin instance:

So to summarize what we had seen:

When we run processes inside the container via kubectl exec, they’re correctly placed within the kubepods cgroup hierarchy. This is the hierarchy that contains the pods memory limits.
When we run the same processes inside the container via spin shell, they’re placed within a cgroup hierarchy that doesn’t contain the limits. We verify this by checking the cgroup file directly:

The value above is close to the maximum value of a 64 bit integer (about 8.5 Billion Gigabytes of memory). Needless to say, our system has less than that, so this is effectively unlimited.

For practical purposes, this means any resource limitation we put on the Pod that runs Spin instances isn’t being honored. So Spin instances can use more memory than they’re allotted which is concerning for a few reasons, but probably most importantly, we depend on this to avoid instances from interfering with one another.

Isolating It

In a complex environment like Spin it’s hard to account for everything that might be affecting the system. Sometimes it’s best to distill problems down to the essential details to properly isolate the issue. We were able to reproduce the cgroup leak in a few different ways. First on the Spin instances directly using crictl or ctr and custom arguments with real Spin instances. Second, running on a local Docker environment . Setting up an experiment like this also allowed for much quicker iteration time when testing potential fixes.

From the experiments we discovered differences between the different runtimes (containerd, Docker, and Podman) execution of systemd containers. Podman for instance has a --systemd flag that enables and disables an integration with the host systemd. containerd has a similar flag –runc-systemd-cgroup that starts runc with the systemd cgroup manager. For Docker, however, no such integration exists (you can modify the cgroup manager via daemon.json, but not via the CLI like Podman and Containerd) and we saw the same cgroup leakage. When comparing the cgroups assigned to the container processes between Docker and Podman, we saw the following

Docker

Podman

Podman placed the systemd and stress processes in a cgroup unique to the container. This allowed Podman to properly delegate the resource limitations to both systemd and any process that systemd spawns. This was the behavior we were looking for!

The Fix

We now had an example of a systemd container properly being isolated from the host with Podman. The trouble was that in our Spin production environments we use Kubernetes that uses Containerd, not Podman, for the container runtime. So how could we leverage what we learned from Podman toward a solution?

While investigating differences between Podman and Docker with respect to Systemd we came across the crux of the fix. By default Docker and containerd use a cgroup driver called cgroupfs to manage the allocation of resources while Podman uses the systemd driver (this is specific to our host operating system COS from Google). The systemd driver delegates responsibility of cgroup management to the host systemd which then properly manages the delegate systemd that’s running in the container.

It’s recommended for nodes running systemd on the host to use the systemd cgroup driver by default, however, COS from Google is still set to use cgroupfs. Checking the developer release notes, we see that in version 101 of COS there is a mention of switching the default cgroup driver to systemd, so the fix is coming!

What’s Next

Debugging this issue was an enlightening experience. If you had asked us before, Is it possible for a container to use more resources than its assigned?, we would have said no. But now that we understand more about how containers deliver the sandbox they provide, it’s become clear the answer should have been, It depends.

Ultimately the escaped permissions were from us bind-mounting /sys/fs/cgroup read-only into the container. A subtle side effect of this, while this directory isn’t writable, all sub directories are. But since this is required of systemd to even boot up, we don’t have the option to remove it. There’s a lot of ongoing work by the container community to get systemd to exist peacefully within containers, but for now we’ll have to make do.

Acknowledgements

Special thanks to Daniel Walsh from RedHat for writing so much on the topic. And Josh Heinrichs from the Spin team for investigating the issue and discovering the fix.

Additional Information

Chris is an infrastructure engineer with a focus on developer platforms. He’s also a member of the ServiceMeshCon program committee and a @Linkerd ambassador.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.