What do you do with a finite amount of time to deal with an infinite number of things that can go wrong?
We prepare for Black Friday Cyber Monday (BFCM) and other high-traffic events during the year to make sure the Shopify platform stays up so our merchants can sell to their buyers. To do this, we built an infrastructure platform at a large scale that is highly complex, interconnected, globally distributed, requiring thoughtful technology investments from a network of teams. We’re changing how the internet works, where no single person can oversee the full design and detail at our scale.
This year over BFCM, we served 75.98M requests per minute to our commerce platform at peak. That’s 1.27M requests per second. Working at this massive scale in a complex and interdependent system, it would be impossible to identify and mitigate every possible risk. This post breaks down a high-level risk mitigation process into four questions that can be applied to nearly any scenario in order to help you make the best use of your time and resources available.