Adventures in Garbage Collection: Improving GC Performance in our Massive Monolith
At the beginning of this year, we ran several experiments aimed at reducing the latency impact of the Ruby garbage collector (GC) in Shopify's monolith. Everything described in this article was a team effort, huge thanks to Jason Hiltz-Laforge for his extended knowledge of our production platform, to Peter Zhu for his deep knowledge of the Ruby garbage collector, as well as to Matt Valentine-House, Étienne Barrié, and Shane Pope for their contributions to this work.
In this article we'll talk about the changes we made to improve GC performance, and more importantly, how we got to these changes.
The work consisted of several rounds of improving logging and metrics, interpreting those to form a hypothesis around a change which would be beneficial, testing and shipping that change, and then evaluating whether it should be kept, tweaked, or reverted.
While the narrative in this article might make it seem like we went straight from problem to result, there were several dead ends, incorrect assumptions, and experiments which did not produce the expected results. All as you would expect from an optimization exercise on a dynamic and complex system.
Lessons From Linguistics: i18n Best Practices for Front-End Developers
Here are a few internationalization (i18n) best practices to help front-end developers avoid errors and serve more robust text content on multilingual websites and apps.
Unlocking Real-time Predictions with Shopify's Machine Learning Platform
Learn how Shopify Data built new online inference capabilities into its Machine Learning Platform to deploy and serve models for real-time prediction at scale.
Improving the Developer Experience with the Ruby LSP
Ruby has an explicit goal to make developers happy. Historically, working towards that goal mostly meant having rich syntax and being an expressive programming language—allowing developers to focus on business logic rather than appeasing the language’s rules.
Today, tooling has become a key part of this goal. Many modern languages, such as TypeScript and Rust, have rich and robust tooling to enhance the programming experience. That’s why we built the Ruby LSP, a new language server that makes coding in Ruby even better by providing extra Ruby features for any editor that has a client layer for the LSP. In this article, we’ll cover how we built the Ruby LSP, the features included within it, and how you can install it.
The Case Against Monkey Patching, From a Rails Core Team Member
Monkey patching is considered one of the more powerful features of the Ruby programming language. However, by the end of this post I’m hoping to convince you that they should be used sparingly, if at all, because they are brittle, dangerous, and often unnecessary. I’ll also share tips on how to use them as safely as possible in the rare cases where you do need to monkey patch.
The 25 Percent Rule for Tackling Technical Debt
Let’s talk about technical debt. Let’s talk about practical usable approaches for actually paying it down on a daily, weekly, monthly, and yearly basis. Let’s talk about what debt needs to be fixed now versus what can wait for better planning.
The Complex Data Models Behind Shopify's Tax Insights Feature
The intensive data work behind Shopify's Tax Insights feature required building functionality to handle dynamically changing information.
Performance Testing At Scale—for BFCM and Beyond
Let’s unpack our approach to BFCM Scale Testing to explore some of what it takes to ensure that our ecommerce platform can handle the busiest weekend of the year.
Making Your React Native Gestures Feel Natural
When working with draggable elements in React Native mobile apps, there are some simple ways to help gestures and animations feel better and more natural.
Monte Carlo Simulations: Separating Signal from Noise in Sampled Success Metrics
Check out this guide for using a Monte Carlo simulation to identify the size and confidence percentage of your sampled success metric.
Reliving Your Happiest HTTP Interactions with Ruby’s VCR Gem
VCR is a Ruby library that records HTTP interactions and plays them back to your test suite, verifying input and returning predictable output.
In Ruby apps it's most frequently used as a testing tool, but having it in your toolbox provides you with a rich set of organizational and debugging tools, even if you choose not to use its popular “automocking” feature.
Migrating our Largest Mobile App to React Native
In 2020, we announced that React Native is the future of mobile at Shopify and since then we’ve been migrating all our native mobile apps to React Native. Since each app is different, there is no single approach that works for all of them. So, we evaluated all the possible options for each app and chose the ones that best suit their needs.
Shopify Point of Sale, for instance, had come a long way since it was first built during an internal hackathon. It was originally designed and built to support small mom-and-pop stores or weekend warriors. However, it has surged in popularity and is being used by some of our biggest merchants and is processing transactions worth billions of dollars each year. The codebase had accumulated a lot of tech debt and the app’s UX was also not serving the needs of large merchants who have hundreds of locations and tens of thousands of products. After a thorough evaluation, it became clear that we couldn’t fix these issues with incremental changes. Hence, we decided to do a full rewrite, which has been a big hit with our merchants.
Shopify Mobile, our flagship mobile app, on the other hand is quite stable and meets our merchants’ needs. It is also our largest app at 300 screens per platform and took over six years to build. Rebuilding it from scratch would be a massive undertaking. Even if we assume that we’d be twice as productive with RN (which is not necessarily the case always), it would take us at least three years to rebuild. That’s a very long time. We would have to halt all new feature development during this time and in the end have the exact same app as we started with. A rewrite then, was clearly out of question.
Optimizing Ruby’s Memory Layout: Variable Width Allocation
Shopify is improving CRuby’s performance in Ruby 3.2 by optimizing the memory layout in the garbage collector through the Variable Width Allocation project.
Automatically Rotating GitHub Tokens (So You Don’t Have To)
GitHub personal access tokens (PATs) are like a key: a very, very large key that opens a very, very wide door. Long-lived tokens that have all the access of a developer’s account won’t just cause a leak—it’ll be a flood. GitHub’s built-in token is useful, but has limitations of its own: it can’t access repo-external resources and it won’t trigger downstream actions (by design). Given the limitations with these two blessed authentication paths, what do you do when these methods don’t work for your use case? We encountered this problem in some of our workflows, and solved it by building a system to rotate tokens automatically. Here’s how we did it, and how you can use it too.
3 (More) Tips for Optimizing Apache Flink Applications
Earlier this year, we shared our tips for optimizing large stateful Apache Flink applications. Below we’ll walk you through 3 more best practices.
Planning in Bets: Risk Mitigation at Scale
What do you do with a finite amount of time to deal with an infinite number of things that can go wrong?
We prepare for Black Friday Cyber Monday (BFCM) and other high-traffic events during the year to make sure the Shopify platform stays up so our merchants can sell to their buyers. To do this, we built an infrastructure platform at a large scale that is highly complex, interconnected, globally distributed, requiring thoughtful technology investments from a network of teams. We’re changing how the internet works, where no single person can oversee the full design and detail at our scale.
This year over BFCM, we served 75.98M requests per minute to our commerce platform at peak. That’s 1.27M requests per second. Working at this massive scale in a complex and interdependent system, it would be impossible to identify and mitigate every possible risk. This post breaks down a high-level risk mitigation process into four questions that can be applied to nearly any scenario in order to help you make the best use of your time and resources available.