SimulatedRides: How Lyft uses load testing to ensure reliable service during peak events

Authors: Remco van Bree, Ben Radler

Contributors: Alex Ilyenko, Ben Radler, Francisco Souza, Garrett Heel, Nathan Hsieh, Remco van Bree, Shu Zheng, Alex Hartwell, Brian Witt

“Load testing in production is great.”

We know what you’re thinking — testing in production is one of the cardinal sins of software development. However, at Lyft we have come to realize that load testing in production is a powerful tool to prepare systems for unexpected bursty traffic and peak events. We’ll explore why Lyft needed a custom performance testing framework that worked in production, how we built a cross-functional solution, and how we’ve continued to improve this testing platform since its launch in 2016.

What exactly do we mean by “Load Testing”? In the context of this article we mean any tool that creates traffic to stress test systems and see how they perform at the limits of their capacity.

Lyft must operate seamlessly even when demand spikes

It’s imperative for Lyft to be highly available during peak events like Halloween and New Year’s Eve. Riders depend on us for their travel needs, and drivers depend on us to be able to make a living, especially on the busiest days of the year. Historically peak events posed a huge challenge for Lyft, as we would experience unprecedented traffic which was often shaped in unusual patterns compared to a normal work week. Furthermore, due to the tremendous speed at which Lyft was growing, traffic was often interacting with new services that didn’t exist during the previous peak event.

We initially considered a variety of load-testing approaches before adopting simulation

Historically, one of our biggest bottlenecks during these peak events was write-operations to data stores. We initially looked to open source and industry standard tools to help us solve these scaling issues.

Record and playback tools are not suitable for probabilistic testing

One of the industry standard ways of load testing is by using record and playback tools like Gatling, K6, and Bees With Machine Guns. With these tools you can record a session, do some light scripting to deal with repetitive scenarios like logging in, and then replay this network traffic from the pre-recorded sessions.

This is a great solution for services that are fully deterministic; like in e-commerce where a website will generally allow a customer to order more of a product.

A two-sided marketplace like Lyft is a lot trickier; requesting a ride only works if there is actually a driver present. Drivers need to be actually matched to a rider and have a location close enough to a rider to be able to start a ride, and so on. Even then, many of the Lyft systems are not deterministic — for instance, there’s no guarantee a particular driver and rider will be matched even if they are located close to one another.

Capturing and replaying production traffic is clunky, slow, and incomplete

In 2016, Lyft was pushing its primary database cluster past its QPS and connection limits. Our initial load testing approach went down the route you might expect: we began recording database traffic and replaying it against a replica database that didn’t service real users. This allowed us to see what happened if we increased the queries per second or number of connections even further.

This approach involved custom code, and we quickly realized several issues — it was difficult and costly in time to create and restore huge database backups (this actually required a support ticket with our vendor each time). Furthermore, if something went wrong, we had limited visibility into why. For example, did we record the right traffic? The feedback loop with this approach was very long as well, requiring us to start the load test from the beginning if any errors occurred during the test.

In addition, concerns were raised about future development of the record/replay pattern. Given the stateful nature of Lyft’s services, replayed traffic could cause issues such as customers being double-billed or be caught by replay protections and circumvent the code paths that cause degradation in the first place.

We considered playing back production traffic against staging, but given the cost of scaling staging to match production during these tests and due to the highly stateful and probabilistic nature of these systems, this approach didn’t seem to offer much benefit. In addition, staging environments often contain lots of trash or test data, meaning testing in this environment can result in false positives or negatives.

Given the lack of success with this record/replay approach, plus the mounting concern that we were missing out on testing all of the interesting parts above the database (such as our Envoy proxy and service mesh), we moved down the simulation path. This helped loosen our focus from data stores, and instead work on a product that would help identify cascading failures that could occur when one or more services were degraded or experienced unusually high load.

We needed a performance testing framework that went beyond historical traffic playback

An architecture diagram of Simulated rides. It shows a SimulatedRides API interacting with a Simulation table. The API has two consumers: the Loadtestbot service and the SimulatedRides Web UI. There are multiple workers (that each contain multiple clients) reading the simulation table. Those workers communicate with a resource management service to get test resources.

SimulatedRides Architecture

We shifted our thinking towards a system where we would dictate some setup or configuration for a scenario we wanted to test (what we refer to as a “simulation”), and service owners would measure the resulting emergent behavior of this simulation. That is to say, the exhaust of these simulations could provide signals in our various observability systems such as metrics, alarms, logs, and so on. Think of this like an “automated QA engineer” that attempts all combinations of flows and inputs in a piece of software, searching for edge cases where combinations of flows do not perform as expected.

The SimulatedRides service is an orchestration service. It manages a fleet of Clients, and simulates interactions that a real user might do in the Lyft app. We offer an interface to SimulatedRides such that it can be treated as a platform by service owners (other teams) at Lyft. These teams can choose to contribute “coverage” with a couple of handfuls of Python code.

There are four key concepts to understand how SimulatedRides works: Simulations, Clients, Actions, and Behaviors

Each Simulation is defined by a single entry in a DynamoDB table. This configuration dictates how the simulation will work, including details such as the number of riders and drivers, the rate at which we want to add and remove these clients into the Lyft systems, the geographical area we want them to be spawned in, and more. A simulation configuration roughly looks like this:

{
"name": "chicago",
"client_configurations": {
"region": "chicago",
"rider_close_app_after_price_check_percent": 1,
"rider_cancel_after_accepting_ride_percent": 10,
"driver_cancel_after_accepting_ride_percent": 5,
},
"client_composition": [
{
"client_type": "rider",
"number": 50,
"behaviors": {
"shared_ride": 25,
"standard_ride": 65
"luxury_ride": 5,
"luxury_ride_suv": 5,
}
},
{
"client_type": "driver",
"number": 50,
"behaviors": {
"standard_ride": 100
}
}
],
}

Our simulations are composed using three key clever abstractions that we named Clients, Actions and Behaviors.

A state diagram that explains how Behaviors, Clients and Actions interact. Behaviors read client state and then determine what Actions clients should perform. The Client then performs those Actions. Some Actions will affect client state.

How Clients, Actions and Behaviors interact

Clients: physical devices in the Lyft service mesh, such as a mobile app or electric bicycle

Clients are a programmatic representation of a physical device. We have clients for the rider and driver iOS/Android apps, as well as for bikes and scooters, which have an IoT on-board that communicates with our servers. Like our applications, clients are stateful and are the only place where we write state in our simulations.

We require that all requests made by clients use public endpoints so as to mimic the native mobile applications as closely as possible. By requiring our simulations to not use any backdoors, we ensure that they are realistic and leave no key endpoints on our golden path untested.

Behaviors: users interacting with devices — a probabilistic decision tree

Behaviors are probabilistic decision trees which can be thought of as a programmatic representation of the user who physically uses the Client (mobile app or device). This abstraction is where the power of permutation allows us to uncover the class of unexpected use cases in Lyft services that have previously caused egregious and hard to uncover incidents. As an example, we can configure that there is a 5% chance that a driver cancels a ride after being matched with a rider, or that a rider logs on only to check ride prices and subsequently closes the app 50% of the time. With each new possibility introduced, the number of possible scenarios covered grows exponentially.

Behaviors inspect the state written to a Client via an Action and probabilistically choose what to do next. For example, a rider who just opened the Lyft app could choose to change settings, check ride prices, request a ride, log out, and so on. Any decision made here would open up a new set of possible choices to be made, thus revealing the tree-like structure of a Behavior.

It may be easier to think of Behaviors as the “scenarios” that a Client exercises in a simulation. These Behaviors do not follow a small set of limited end to end scenarios, but instead probabilistically branch between flows. As a result, we can influence the odds of a certain outcome happening via our simulation configurations.

Due to the complexity of this system, maintaining the realism of these flows and network calls relies on service owners keeping integrations up to date. This maintenance is both a positive and a negative — it requires manual work, but it also ensures service owners are being proactive in testing their services and features. It has allowed Lyft to shift engineering culture away from reactive treatment of incidents and instead be overwhelmingly proactive. This shift did not occur overnight and required a coordinated effort of documentation, communication, and education.

Actions: the outcome of a single act a user may take while interacting with a device — almost always a network request

Actions are the common place where engineers at Lyft contribute coverage for their services. Clients perform actions. Most actions are a single network call, such as “RequestRide”or “AcceptRide”. This means that by writing just a few lines of code in an Action, a service owner can immediately start sending synthetic traffic to any endpoint in their services.

Actions are also deeply configurable. Through a simple configuration block, an engineer dictates when an Action is allowed to be invoked. For example, we can define that a Client must not be in a ride for the “RequestRide” Action to be invoked, or state that there is a 5% probability that any time the “AddCouponForRide” Action is invoked we actually execute the business logic contained therein. This configurability helps us to roll out synthetic traffic to a new endpoint slowly and deliberately and fine-tune traffic patterns.

Actions are the only abstraction in the system that can write state to a Client — for example, an Action may poll an endpoint to keep the current ride state in sync with the server, and each time it polls it writes that new ride information to the Client.

An additional Resource Manager service exposes an interface for using test assets in simulations

As SimulatedRides adoption increased and its value became apparent, we encountered issues with test users. Creating and destroying objects like users, bikes, and scooters are costly in time and cause cascading effects into downstream services. SimulatedRides intentionally does not create and destroy these objects that participate in a simulation. In order to ensure these test users were reset to a consistent baseline, we authored a separate service known as the ResourceManager. This system solves a hard test-data management problem by managing a fixed pool of “resources” — a generalized name for the test assets we need for simulations, such as users, bicycles, scooters, etc.

ResourceManager exposes an interface to acquire, lease, and restore the health of resources (users, bikes, scooters). Any service that needs test assets, such as SimulatedRides simulations or Acceptance Tests, can take a lease for a given duration on a test asset. Once that test asset is no longer needed or its lease expires, the ResourceManager runs the resource through a “Health Restoration Platform”. This is a simple system which runs a set of checks and fixes to ensure that those resources are reset to a known-good state. For example, this may remove users from orphaned rides, ensure valid credit cards are on file, and approve users to drive in a particular geographical area.

The “Health Restoration Platform” is also treated as a platform interface where service owners can contribute new checks and fixes in the event their simulation mutates the state of underlying objects.

The service intelligently scales to meet the demand of large simulations.

The architecture of this system is interesting and quite different from the flask/gunicorn microservice approach that much of Lyft employs and that you might be familiar with in more conventional microservices. Instead, SimulatedRides makes extensive use of Python’s asyncio library via a worker and event loop model in order to leverage concurrency in the code. This allows the service to scale to large traffic volumes required to run large simulations while making efficient use of compute power. Further, this design means our simulation workers dynamically divvy up work across all pods in the cluster in order to load balance — no one server running simulations will be responsible for appreciably more or less workload than another.

We do offer a sibling product which makes use of some of the same abstractions and code in SimulatedRides — our deterministic test platform, known as the Acceptance Test Framework, can be read about in detail in our blog post here.

Lyft uses SimulatedRides to test the performance of production and staging environments

We conduct regular production load tests. In the run up to peak events we shift the targets of these tests to look like, and in most cases, to far exceed the expected traffic patterns during upcoming events. We have historically measured our load test targets in ride drop-offs per minute, though many other metrics play into our forecasting as well. Our production load tests run on a schedule and are highly automated. Before kicking off a load test, our automation slack bot posts a slack message with information about the expected duration, links to observe the test, and an “@” mention to the load test conductor on call.

The load test conductor is a directly responsible individual who has authority to stop a test in event of an external incident. While the load tests are highly automated, we do keep a human in the loop during load tests since a runaway load test could easily have a major business impact and human practitioners are the adaptable element of complex systems.

There is an interface which allows fine tuning of both the shape and size of the traffic patterns during these tests. This means we can run a load test that, for example, mimics Halloween 2019, or models the forecasted load for an upcoming event like Super Bowl Sunday. Or, we can spawn thousands of riders in a small area who all log on to check ride prices, simulating an event like thousands of people attempting to leave Times Square in New York City just after the stroke of midnight on New Years Eve.

Our load tests inject simulated users into the production Lyft systems, ramping up to the target load which closely matches the forecast shape of traffic expected for the upcoming peak event. Once we hit the target load we will maintain that load for some time, known internally as the “soak”, before ramping down again and ultimately removing all simulated users from Lyft.

While load tests are running, on-call engineers throughout the organization will receive immediate feedback from our robust observability platforms. If the load test affects the health of their systems, this will be seen quickly via a business metric or SLO, and appropriate teams will respond to any degradation or unexpected behavior in a service just as they would during a real peak event. There’s an important distinction: unlike a real peak event, a load test can be immediately halted at the request of any individual in the company.

Without going into exact detail, we have strict systems in place to ensure that test data does not pollute our business metrics.

Continuous load in staging uncovers bugs from complex scenarios that would be hard to find manually

SimulatedRides provides incredible value to Lyft outside of production too. In staging environments, it provides 24/7/365 load to any services which opt into the platform. Since we use success rates to measure service health at Lyft, traffic is required to generate a success rate baseline. This means service owners who run simulations in staging can catch issues in a pre-production environment in a quick feedback loop before releasing their code to production. In practice this means code that is deployed to staging will quickly cause alarms to fire if there are bugs or bad configurations without any manual testing required.

The value of having a customizable and probabilistic solution outweighs the drawbacks

SimulatedRides is a really powerful tool that has had positive impact over the years at Lyft:

Alongside other Infrastructure teams, this product has helped lead a multi-year engineering culture shift away from reactive patterns and instead towards proactive “expect everything to fail” systems design.
Through TPM-led programs for peak events preparation, brown bags, updates to technical design document templates, documentation, and other educational initiatives, this system is widely available and has been adopted by hundreds of service owners across Lyft.
Service owners can quickly and easily add load to specific endpoints in their systems.
Synthetic traffic doesn’t need to have a perfect resemblance to real traffic. With good enough realism we can stress our production services and identify anomalies and bottlenecks.
Service owners regularly see how their service performs under stress and will be more prepared for times when there is real traffic that can’t be turned off with a button click.
Having constant traffic in staging allows us to know code changes will likely not degrade or break real users experiences without ever deploying them into production.
Without a need to run load tests on staging, our staging environment doesn’t need to be a perfect infrastructure copy of production (with multiple availability zones etc.), which saves us a lot on the cost of infrastructure.
The probabilistic nature of our simulations allows us to test emergent behaviors. This means that we do not just test a limited number of pre-defined scenarios, but also uncover uncommon scenarios without explicitly defining them in code.
When we test or deploy new infrastructure, we use simulated rides to ensure the new infrastructure can stand up to production traffic levels.

That said, there are also some significant drawbacks:

The learning curve of this tool is quite steep for engineers who work on the framework.
The fact that this tool is bespoke means we can’t hire people with experience working on this tool.
The integrations with services require upkeep and maintenance.

SimulatedRides will grow to become fully automated and offer more robust testing opportunities

Some future plans for SimulatedRides are:

More monitoring on business outcomes (number of rides, revenue, etc) instead of on just action success and server errors. This will give developers more confidence that their changes don’t have a negative impact on business outcomes.
Allow developers to send a percentage of staging traffic through an unmerged feature branch or local running service, so that they can send real traffic to their service as they are developing and debugging a new feature.
Fully eliminating the need for a human conductor as part of the load test automation.

Lyft has future-proofed peak event performance

Load testing complex systems is a difficult problem. We have found the best fidelity in production. While there is some risk to load testing in this way, this risk can be mitigated with good tooling and guard rails. The SimulatedRides platform has been instrumental in giving us confidence that our systems can deal with hyper-growth and peak events, as there is no better place to test how your systems hold up under real load than production.

As always, Lyft is hiring! If you’re passionate about developing state of the art systems join our team.