Adventures in Garbage Collection: Improving GC Performance in our Massive Monolith
At the beginning of this year, we ran several experiments aimed at reducing the latency impact of the Ruby garbage collector (GC) in Shopify's monolith. Everything described in this article was a team effort, huge thanks to Jason Hiltz-Laforge for his extended knowledge of our production platform, to Peter Zhu for his deep knowledge of the Ruby garbage collector, as well as to Matt Valentine-House, Étienne Barrié, and Shane Pope for their contributions to this work.
In this article we'll talk about the changes we made to improve GC performance, and more importantly, how we got to these changes.
The work consisted of several rounds of improving logging and metrics, interpreting those to form a hypothesis around a change which would be beneficial, testing and shipping that change, and then evaluating whether it should be kept, tweaked, or reverted.
While the narrative in this article might make it seem like we went straight from problem to result, there were several dead ends, incorrect assumptions, and experiments which did not produce the expected results. All as you would expect from an optimization exercise on a dynamic and complex system.
A Word of Caution
Before we get into what changes we made, I really want to stress that you should never blindly apply GC settings in your apps without first establishing a proper way to measure the impact of such changes. Modifying the GC settings can radically impact the performance of an application, both in a positive and negative way.
Ruby’s default GC settings are fairly decent and are performing well for the vast majority of Ruby applications, so I strongly advise not to tune it unless you have strong evidence it’s substantially impacting the performance of your application.
With that said, let’s lay out the problem.
Shopify's Monolith has a Garbage Problem
In Ruby, everything is an object and nearly every operation creates one or more objects. These objects are not actively freed by the programmer—when an object is no longer referenced, it is freed by Ruby's garbage collector.
How this works is a deep and complex topic and it's not something we're going to try to fully explain here. There are plenty of great articles if you want to dig deeper.
But in general we can say that Ruby has a conservative, generational, incremental, mark-and-sweep tracing garbage collector.
The key thing to understand here is that when the Ruby garbage collector runs, all other execution is paused. So when the garbage collector decides that it needs to do some work, it typically does so during the life of a request. This garbage collection work results in increased latencies for client requests, causing merchants and buyers to wait longer than they otherwise would for an answer.
Here's a 24-hour slice of the distribution of time spent in GC per request when we started looking into the problem:
These graphs show the distribution of time spent in garbage collection for requests that have GC information logged (1% of all requests). The median request in this group spent 3ms doing some form of garbage collection work, which is totally fine. The p99 request was spending ~323ms in garbage collection, which is already not great. But then as we go toward the tail end, we see some requests spending multiple seconds doing GC work. That was the main issue we were after.
Generational Garbage Collectors
The reason the distribution looks like this is in big part because Ruby's GC is generational. Each Ruby object has an age, which starts at 0
when it's first allocated. Every time the GC is triggered, every object that survives has its age incremented. When an object reaches the maximum age of 3
, it’s promoted to the old generation.
Then when the GC triggers, it can either do a major or minor marking based on multiple heuristics. A major marking will mark all objects, whereas a minor marking will only go over young objects, which are only a small subset of all objects.
Hence minor marks are quite fast, in our case ~3ms as shown by the latency graph, and majors very slow, over 4 seconds.
However because Ruby's GC is also “incremental”, majors aren't always done in one chunk. The GC can mark one part of the heap, then allow Ruby code to execute for a bit, and resume marking later.
Our main goal here was to make major marks much less frequent, so we had to track down why major marks were triggered in the first place, and solve that problem.
Tuning the Garbage Collector
When we started this work, the application already had some GC tuning in place from a few years ago:
But clearly, these were either outdated or not quite right. Each of these environment variables is documented in the Ruby man page, but we'll explain some of these in more detail later in the article.
Why was Major Marking Triggered?
To answer this question, we started logging [GC.latest_gc_info(:major_by)](https://docs.ruby-lang.org/en/3.2/GC.html#method-c-latest_gc_info)
which contains the reason why the last major GC was triggered.
What we observed was that about 75% of the major marks were triggered because of :oldmalloc
, and almost 25% because of :shady
.
The oldmalloc
limit
oldmalloc
is a counter of the number of bytes allocated. Every time memory is allocated through malloc, the number of allocated bytes is added to that counter and once a threshold is reached, a major GC is triggered. Every time a major GC is triggered, that counter is reset to 0.
The assumption made by the Ruby GC is that if old objects keep allocating frequently, it's probable that there's some churn in their references and that some old objects may be reclaimed.
The state of the counter can be queried via GC.stat(:oldmalloc_increase_bytes)
, and the limit that will trigger a major GC can be queried via GC.stat(:oldmalloc_increase_bytes_limit)
.
It’s also possible to configure that limit through environment variables, namely: RUBY_GC_OLDMALLOC_LIMIT
, RUBY_GC_OLDMALLOC_LIMIT_MAX
, and RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR
.
We believed that this heuristic isn't really relevant for our type of workload, so we decided to set it to a ridiculous 128GB, which pretty much disables it:
We have to set both the LIMIT
and LIMIT_MAX
, because normally, every time the limit
is reached, it grows by RUBY_GC_OLDMALLOC_LIMIT_GROWTH_FACTOR
such as:
If you have a keen eye, you may have noticed that the old config didn't set RUBY_GC_OLDMALLOC_LIMIT_MAX
at all, which means it actually had no effect...
We applied that change to 50% of our production process, and immediately noticed a substantial reduction in major marking by about 20%:
At this point you may wonder why the reduction wasn't by 75%—it's because a train can hide another. Whenever a full GC triggers, it resets most of the heuristics, so solving one cause can sometimes just allow another heuristic to trigger much more frequently.
When we looked at the logs for the reason for GC, now it was overwhelmingly caused by nofree
.
Object Slots
To understand the nofree
heuristic, you first need to know that the Ruby heap consists of pages of slots. When the VM needs to allocate an object, it looks for an empty slot to store the object in. If no slots are available, a minor marking will be triggered to make some space. After that minor marking is done, if there aren't enough free slots, a major marking may be triggered to try to make more space. Finally, if that doesn’t work, more slots will be created.
The nofree
reason indicates this situation, and all these limits can also be controlled via some more environment variables:
Generally speaking, nofree
happens early in the lifecycle of a Ruby process. Once the process has warmed up a bit, it generally has enough slots that it constantly recycles. However, Shopify's monolith is deployed approximately every 30 minutes, so aside from weekends, we're pretty much constantly in that “warm-up phase”.
The solution to this is the RUBY_GC_HEAP_INIT_SLOTS
variable, which defines how many slots Ruby should allocate immediately upon starting. The old GC config was preallocating 1.6M slots. After instrumenting GC.start(:heap_live_slots)
in production, we determined we needed closer to 15M slots.
If we were still on Ruby 3.1, we could have just set RUBY_GC_HEAP_INIT_SLOTS=15000000
and we'd be done. However, we upgraded to Ruby 3.2 in early January, which introduced variable width allocation.
In Ruby 3.2 there are now 5 different types of slot, which you can inspect through GC.stat_heap
:
For each pool, heap_eden_slots
tells us how many slots are either free or in use, and force_major_gc_count
tells us how many times Ruby had to trigger major marking because it was out of free slots in that particular pool. By instrumenting these counters in production, we were able to determine how many slots we should preallocate in each pool.
But in Ruby 3.2, if you simply set RUBY_GC_HEAP_INIT_SLOTS
, the initial slots will be split between all size pools relative to their size:
That makes sense, but that makes it impossible to properly pre-allocate exactly what we need. On a warmed up worker in production, the sizes looked more like:
With just RUBY_GC_HEAP_INIT_SLOTS
, we'd have to either terribly over-allocate some pools, or to terribly under-allocate some others, which was really not acceptable.
Implementing Finely-Grained Preallocation
To test the effectiveness of properly preallocating each pool size without wasting memory, we devised a terrible hack that we loaded in config/boot.rb
:
And then we were able to finely control it via environment variables:
As soon as we deploy the dirty hack, we noticed a major positive effect on the GC tail latency (gc_tuned_2
):
We tweaked the numbers several times to find a sweet spot, but overall we were very satisfied with the results, so we immediately replaced the hack with a proper Ruby feature and backported it to our own builds of Ruby 3.2.
We were happy with the progress, but still not happy with the overall time spent in GC, so we kept digging.
Shady Objects
The next source of major marking we looked at was shady
. This one is a bit hard to explain, as it requires an understanding of the generational GC and Write Barriers.
In short, to be promoted to the old generation (hence only be marked on majors), Ruby objects must pledge to always notify the GC whenever they update one of their references to other objects. That's what we call “write barriers”. They can be tricky to implement, and any forgotten write barrier may lead to weird bugs, as the referenced object might be garbage collected and its object slot reused by another object. Because of this, there are still a good number of Ruby C extensions that don't implement write barriers, and even some core Ruby types don't implement them.
Normally, when an old object starts referencing a young object, the young object is immediately promoted to the old generation. But if the referenced object didn't pledge to trigger write barriers, that's impossible, so the GC needs to keep that young object in a list to make sure to mark it on every minor. That's what a shady
object is.
Internally in the Ruby VM, the term shady
has been replaced by uncollectible_wb_unprotected_object
, but shady
is still the keyword you'll see in GC.latest_gc_info
.
Since these objects are slowing down minor markings, a major mark is automatically triggered when too many of them are accumulated in the hope to free some of them.
Like with oldmalloc
, these counters can be introspected through GC.stat(:remembered_wb_unprotected_objects)
and GC.stat(:remembered_wb_unprotected_objects_limit)
. However, this limit isn't really configurable like the others.
At this stage we could have introduced an environment variable to raise the shady
object limit, but that would have meant slowing down minor marks to make major marks less frequent—hardly a pleasant tradeoff. So instead we decided to address the root cause of the issue and started implementing write barriers on the most common offenders we found in heap dumps taken in production with ObjectSpace.dump_all
.
Overall we made close to forty pull requests to Ruby and various popular gems, way too many to list here. But to give an idea of what it entailed, some notable ones were [Time](https://github.com/ruby/ruby/pull/7244)
, [BigDecimal](https://github.com/ruby/bigdecimal/pull/248)
, [Method](https://github.com/ruby/ruby/pull/7237)
, [oj](https://github.com/ohler55/oj/pull/842)
, [google-protobuf](https://github.com/protocolbuffers/protobuf/pull/11793)
, [ffi](https://github.com/ffi/ffi/pull/991)
,
All these upstream contributions took time, so we kept doing them in the background while continuing to look for other GC tuning opportunities. When we started this project, our production heaps had around 160,000 shady
objects in them, as we're writing this, it’s down to around 3,000, and most of that is from google-protobuf
and ffi
which haven’t been released since our fixes were merged.
Resurrecting Out-of-Band Garbage Collection
At this stage of the effort, we didn't have much GC tuning knobs left to turn. Since the beginning, our approach was to try to eliminate all of the causes of major marking, but we eventually realized that’s simply not possible.
A fraction of the ephemeral objects that should be released at the end of the request cycle will inevitably be promoted to the old generation, and only a major mark can possibly collect them. Consequently, major marks are a necessary part of the life of a Ruby process.
So if we can't eliminate major GC, the next best thing would be to trigger it when the process is not currently rendering a request.
This isn't a new idea by any means. It was even quite popular in the days before Ruby 2.1 when generational and incremental GC was added, but over time it fell out of favor because of its throughput impact and is generally considered a deprecated feature.
Something to note is that historically Out-of-Band (OOB) Garbage Collection was triggered after every request, and indeed, with our full marks taking over 4 seconds, that would be a tremendous waste of resources.
But given all the improvements we previously shipped, we didn't need to trigger a major mark after every request cycle, just once in a while would be perfectly fine. So we decided to try it with a frequency that seemed fair—once every 100 requests—and the results were beyond our expectations.
First, major marking during request cycles became pretty much a fringe event:
Note: default
was the original GC config before this tuning effort, gc_tuned_2
was our current GC tuning effort, and gc_tuned_3
was our new tuning plus Out-of-Band Garbage Collection.
Then the very tail end (P99.99 and P99.9) time spent in GC was reduced almost 10x:
On P99, it was a bit more modest, but still extremely good:
And most surprisingly, it basically eliminated GC pauses for the median requests:
This wasn't what we were after, but it was a nice surprise.
One concern however was the extra CPU utilization, given that historically this was what doomed the OOB GC technique. The 100 request interval was chosen purely arbitrarily, so we experimented with different intervals, and settled with a variable frequency that starts at once every 128 requests, going up to once every 512 as the process ages.
The reasoning being that recently booted processes tend to have some improperly eagerloaded objects, so as they process their first requests, they tend to allocate objects that won't ever be released. But the number of these potentially lazy loaded objects is bounded, so as the process gets older, the number of old objects tend to stabilize, hence we need less frequent GC.
Reducing the frequency didn't degrade the tail latency in any way, but did make the median GC time per request jump back to ~2ms, which we found acceptable. Tail latency was really what we wanted to improve.
We could also attempt to be smarter than that and try to guess how likely Ruby is to trigger a major mark on the next request by introspecting GC.stat
, but we haven't explored that avenue yet.
Wrapping Up and Looking Ahead
After about three weeks of this effort, we decided we were satisfied with the results and applied the new configuration to the entirety of the fleet.
Amusingly, since we were generally testing our modified settings on 50% or 33% of the production fleet, the progress of our optimization effort is visible on the global time spent in GC over the course of the three weeks:
We haven't stopped working on the Ruby GC though. The things we learned during this tuning effort have inspired multiple longer term projects for the Ruby Infrastructure team at Shopify to improve the GC itself.
One thing we'd like to explore is the possibility of introducing a new “permanent” generation, which would hold objects that we know can't possibly be collected, and that would hopefully allow major marks to be much smaller.
We're also looking at ways to slow down the young objects’ aging process. We believe that some larger API requests tend to trigger more than three minor marks, causing many objects that should be ephemeral to be promoted to the old generation. We thought of introducing an “aging factor”, so that not all minor marks increment the marked objects age, or alternatively, to allow to entirely disable promotion to the old generation after the process has warmed up enough.
We'd also like to expose the GC trigger reasons directly in GC.stat
so that it's easier to instrument.
Maybe some of this will make it to Ruby 3.3, who knows.
Jean Boussier is a Rails Core team member, Ruby committer, and Senior Staff Engineer on Shopify's Ruby and Rails Infrastructure team. You can find him on GitHub as @byroot or on Twitter at @_byroot.
Open source software plays a vital and integral part at Shopify. If being a part of an Engineering organization that’s committed to the support and stewardship of open source software sounds exciting to you, visit our Engineering career page to find out about our open positions and learn about Digital by Design.