At the beginning of this year, we ran several experiments aimed at reducing the latency impact of the Ruby garbage collector (GC) in Shopify's monolith. Everything described in this article was a team effort, huge thanks to Jason Hiltz-Laforge for his extended knowledge of our production platform, to Peter Zhu for his deep knowledge of the Ruby garbage collector, as well as to Matt Valentine-House, Étienne Barrié, and Shane Pope for their contributions to this work.
In this article we'll talk about the changes we made to improve GC performance, and more importantly, how we got to these changes.
The work consisted of several rounds of improving logging and metrics, interpreting those to form a hypothesis around a change which would be beneficial, testing and shipping that change, and then evaluating whether it should be kept, tweaked, or reverted.
While the narrative in this article might make it seem like we went straight from problem to result, there were several dead ends, incorrect assumptions, and experiments which did not produce the expected results. All as you would expect from an optimization exercise on a dynamic and complex system.