Scaling productivity on microservices at Lyft (Part 4): Gating Deploys with Automated Acceptance Tests

Published in

Lyft Engineering

10 min readFeb 2, 2022

Authors: Ben Radler, Garrett Heel, Shu Zheng

This is the fourth and final post in the series on how we scaled our development practices at Lyft in the face of an ever-increasing number of developers and services.

Part One: Scaling productivity on microservices at Lyft
Part Two: Optimizing for fast local development
Part Three: Extending our envoy mesh with staging overrides
Part Four: Gating deploys with automated acceptance tests (this post)

In our previous post, we described how we leverage context propagation to allow multiple engineers to test end-to-end (E2E) in a shared staging environment. Now we’ll look at the other side of the coin — automated E2E testing — and how we built a scalable solution that gives engineers confidence before deploying to production.

Rethinking End-To-End Tests

Part 1 of this series walked through many of the challenges we’ve encountered with running integration tests in CI. An explosion in the number of services and engineers caused the remote development environments (Onebox) running tests to become difficult to scale and time consuming to spin up. Integration tests belonging to each service had also grown unwieldy and took upwards of an hour to complete with an extremely poor signal to noise ratio. Test failures couldn’t be trusted, so they were often ignored, or–even worse–resulted in further hours wasted debugging.

Among the thousands of integration tests spanning 900 services laid a small set of truly valuable E2E scenarios that we felt were crucial to maintain. Scenarios like a user being able to login, request a ride, and pay for their ride. Issues in these scenarios are called SEV0s internally (the highest severity of incident). They prevent our passengers from getting where they need to go and drivers from earning an income; they must be avoided at all costs.

Enter Acceptance Tests

When looking at the highest value E2E tests that we wanted to keep, we didn’t have to squint too hard to see that they looked a lot like acceptance tests. That is, tests which describe how a user interacts with the Lyft platform without knowledge of internal implementation details.

With this idea in mind, we decided to move from a distributed model–where each service defines its own set of integration tests–to a small centralized collection of acceptance tests. The advantages of this were twofold. At a technical level, bringing scenarios under one roof allowed us to eliminate duplication and share test code between relevant services. Organizationally we now have a single-threaded owner responsible for coordinating the overall health of these tests (which are still written and modified by a diverse set of people) and are better positioned to place guardrails to avoid them growing out of control.

Another key decision point was when these tests would run. One of the habits we wanted to change was running E2E tests as part of the “inner” development loop (described in Part 2), where E2E tests would be run many times during development in place of strategies like unit tests or invoking a specific service endpoint. Instead, we wanted to make CI lightning fast and encourage folks to be more comfortable with deferring E2E tests (which should break much less frequently) to later in the process. For these reasons we chose to run acceptance tests in the staging environment after each deploy and gate deploys to production on their success.

Building the framework

Engine

First and foremost we needed an engine to provide a simple interface to exercise Lyft’s API in the same manner that a real user would. Fortunately for us, we’d already built something similar to generate traffic in our staging and production environments (see Staging environment in Part 1). This library is composed of a few key concepts:

Actions: Interactions with the Lyft API, e.g., the RequestRide Action would call Lyft’s API with the desired origin and destination to begin searching for a driver.
Behaviors: The brain that makes probabilistic decisions about what to do next, e.g., given that I just requested a ride, should I cancel it or continue waiting for the driver?
Clients: Represent a device used to interact with our platform, usually a mobile phone running the Lyft app. Used for storing state and coordinating actions/behaviors.

The combination of these three simple ideas has been the basis of our staging and production automated testing strategy for the past 5 years and serves us well. However, one significant gap remained for reusing it in acceptance tests–the probabilistic nature of behaviors. When building behaviors with load testing in mind, we wrote them as something closer to a fuzzer as randomness is a useful property in shaking out emergent bugs. This is not suited for acceptance testing where we want to test very specific flows deterministically.

In order to bridge this gap, we updated the library to allow clients to act on a sequence of steps as an alternative to having a behavior. A step could be any of the following:

Actions: As above, these simply execute an API call.
Conditions: Block execution of the next step until some expression is true with an optional timeout, e.g. a driver might wait until they arrive at the origin to notify Lyft that they’ve picked up the passenger via the PickedUp action.
Assertions: Ensure the client state looks as expected, e.g. we want to ensure that a price quote exists before requesting a ride.

Defining tests

With the building blocks of actions, conditions and assertions in place, we next needed to decide on a format for defining tests in their new centralized home. Previously our integration tests had been in code, however we decided to switch to storing tests in a custom configuration syntax for acceptance tests. While not without trade-offs, we’ve found that defining tests this way serves as a forcing function to keep them simple and consistent–broadening the number of people that can read/write them and leading to better maintained tests. The limited interface exposed in configuration pushes most logic into the aforementioned library, where it can better be shared with other acceptance tests or the load test runner.

Putting it all together, see the following example of a test scenario:

It’s worth noting that there are existing testing frameworks, such as Cucumber/Gherkin, that might have served us better if we were starting from scratch. In our case it was much easier to extend our existing traffic generation tooling than try to bridge those technologies.

Gating Deploys

A large part of the developer productivity gains we sought from changing our E2E test approach was based on changing when these tests run from pre-merge in PRs to post-merge during deploys. While a typical PR might see around 4 commits on average (each commit running the test suite), we’ll typically deploy one or two PRs at a time, resulting in flaky tests blocking developers almost 10x less frequently. Furthermore, we anticipated that the lack of E2E feedback in PRs would aid in removing an often false sense of security and drive more investment in unit tests and safer rollout strategies such as feature flags (which can now be overridden on a per-request basis via Configuration Overrides discussed in Part 3).

To implement this we extended our in-house deploy system’s concept of a deploy gate. Deploy stages can be blocked by one or more gates, which represent a condition that must be true before a deploy is allowed to progress to the next stage. An example gate that is included on every staging deploy is one we call the bake time. This gate ensures that a specific amount of time has elapsed in order to give the persistent simulated traffic a chance to trip alarms.

A deploy gate is added for each acceptance test a service lists as a dependency. This gate kicks off the test runs as soon as the staging deploy finishes and reports back on their success or failure. In order to not slow down developers, acceptance tests target to run in less time than our default bake time — 10 minutes.

Putting it into practice

What should be a test?

Arguably one of the hardest parts of transitioning to acceptance tests was determining the rules for what constitutes an acceptance test and applying those across the enormous collection of integration tests. After sifting through hundreds of integration tests and discussing with service owners, we settled on the following criteria:

Acceptance tests should represent critical business flows only. They should describe an end-to-end interaction with the Lyft platform from the perspective of our users.
Acceptance tests must be critical to the business. We explicitly don’t want to test every scenario. As the most expensive tests in our suite, we can’t afford to test edge cases or flows that wouldn’t cause substantial damage (i.e. a SEV0) to the business if they broke for a short amount of time.

While some of these criteria might seem obvious from a quick glance at the Test Pyramid, it was still more difficult than anticipated to apply them. Developers were uncomfortable with the prospect of deleting their integration tests regardless of whether that test was well understood or had ever caught a bug. To ease the transition, we partnered with teams to go through the tedious exercise of analyzing every single test against the above criteria. About 95% of the tests were either redundant or could be rewritten as a unit test with mocks. The few tests remaining were de-duplicated and combined to form a set of ~40 total acceptance test scenarios that would replace all integration tests.

Results

Around 6 months has elapsed since we replaced integration tests in CI with acceptance testing in staging. The number of scenarios has remained relatively stable, we’ve expanded coverage to our transit and bike & scooter products, and we currently run a few thousand tests each week. The major benefits we’ve seen are:

Most PRs are passing unit tests and ready to merge in less than 10 minutes (down from 30 minutes with E2E included).
A few thousand integration tests have been removed from services. A huge amount of time maintaining and debugging these tests has been eliminated.
Acceptance tests are faster and more reliable to iterate on, taking less than a minute to prepare a local environment that targets staging (using the local development workflow previously mentioned; down from a 1 hour initial setup time on Onebox).
Since removing E2E tests from PRs there has been no measurable increase in the number of bugs reaching production.
Acceptance tests are catching issues weekly before they make it into production.
We haven’t yet seen as much additional investment in unit testing as we had hoped. This requires further investigation to understand why and whether there’s more we can/should do to change that.

Future work

We’re thrilled with the productivity improvements we’ve seen from these changes so far, but there’s still a lot of improvements that we’re looking at for the future.

Staging isolation

Currently we run tests immediately after deploying a change to the staging environment, which has the potential to disrupt other staging users when a problem occurs. We’d like to look at building on top of the staging overrides work discussed in the third post of this series to run acceptance tests against the new version of the service before exposing it to other users. This would add some additional latency to deployments, so we’d need to evaluate whether the benefits outweigh that cost.

Coverage

Given that the primary goal behind these tests is to improve our reliability, we’d like to do more to directly improve that rather than just maintain it. We know that gaps exist in the tests we have today, which were built from previous integration tests and talking to service owners about which flows are important to cover. To close these gaps and improve reliability, we need to ensure that all of the most-used API calls that our real iOS and Android clients make are represented in these tests. One idea we have here is to do more analysis on the delta between the real traffic flowing through our system and simulated traffic, possibly via further investment in our distributed tracing tooling.

Test scenario health

Initially the platform team manually curated each acceptance test and kept an eye on flakiness. As we continue to onboard more lines of business, we’d like to have more granular observability into each action (API call) so we can route failures to the appropriate team automatically. This doesn’t mean diffuse ownership — we think it’s important to retain a central owner for the broader set of tests and platform — but it will alert product teams to failures in their services more quickly and remove manual effort.

Zooming Out

Over this series we’ve taken a closer look at how Lyft’s development and testing approach has evolved over the years in our quest to continually improve developer productivity starting with some history (post 1). We’ve covered our shift towards local-first development (post 2), isolated testing of services in staging with envoy overrides (post 3), and replacing heavier integration tests in PRs with a small set of acceptance tests during deploys (this post).

Although this approach might not work for every set of circumstances, it has been largely successful for us in shortening the feedback loop for developers as they’re writing code and drastically simplifying the infrastructure powering our test environments. We’d love to hear from others facing similar challenges to those we’ve discussed, feel free to drop us a line in the comments!

If you’re interested in working on developer productivity problems like these then take a look at our careers page.

Special thanks to following people that helped to create this post: Brian Balser, Daniel Metz, Jake Kaufman, Susan Chan, Thompson Marzagão, Tony Li