Scaling productivity on microservices at Lyft (Part 3): Extending our Envoy mesh with staging overrides

11 min readDec 15, 2021

This is the third post in the series on how we scaled our development practices at Lyft in the face of an ever-increasing number of developers and services.

Part One: Scaling productivity on microservices at Lyft
Part Two: Optimizing for fast local development
Part Three: Extending our Envoy mesh with staging overrides (this post)
Part Four: Gating deployments with automated acceptance tests

In our previous post, we described our laptop development workflow designed for fast iteration of local services. In this post, we’ll detail our solution for safe and isolated end-to-end (E2E) testing in staging: our pre-production shared environment. We’ll briefly recap the issues that led us to this form of testing before diving deeper into implementation details.

Previous Integration Environments

In Part 1 of this series we described Onebox, our previous tool for multi-service E2E testing. Onebox users would lease a large AWS EC2 VM to spin up 100+ services to validate that their modifications worked across service boundaries. This solution would give each developer a sandbox to run their own version of Lyft, giving control over each service’s version, database contents, and runtime configuration.

Each developer spins up and manages their own isolated Onebox

Onebox unfortunately encountered scaling problems as Lyft grew both its engineer and service count (read more in our first post). Because of these issues, we needed to find a sustainable replacement for performing our E2E testing.

We looked towards our shared staging environment as a viable replacement. Staging gave us confidence due to its similarity to production, but we needed to add a missing piece for it to feel like a safe environment for development: isolation.

Staging Environment

Our staging environment runs the same tech stack as production but with scaled-in resources, fake user data, and a synthetic web traffic generator. Staging at Lyft is a first-class environment where breached SLOs will page on-call engineers and developers will raise a SEV if the environment becomes unstable. Though staging’s availability and genuine traffic add E2E confidence, some problems could arise if we encouraged it for extensive development:

Staging is an entirely shared environment like production; if someone deployed a malfunctioning instance to the staging mesh, it would affect others who (perhaps transitively) depend on that service.
The intended way of shipping new code to staging is by merging a PR to main, which will kick off a new deploy pipeline. This carries a lot of process burden just to test how an experimental change works E2E: writing tests, getting code review, merging, and progressing through CI/CD.
This burdensome process could result in users utilizing an escape hatch: deploying their PR branch directly to staging. With unvetted commits running in staging, we’d further amplify issues where bugs degrade the environment stability.

Our goal was to overcome these challenges to make the staging environment a better place for manual validation of E2E workflows. We wanted to let users test their code in staging without being bogged down by process, all while minimizing the blast radius of their changes should something with their revision go wrong. To accomplish this, we created staging overrides.

Staging Overrides

Staging overrides is a set of tools to safely and quickly validate user changes in the staging environment. We fundamentally shifted our approach for the isolation model: instead of providing fully isolated environments, we isolated requests within a shared environment. At its core, we enable users to override how their request flows through the staging environment to conditionally exercise their experimental code. The rough workflow looks like this:

Create a new deployment on staging that doesn’t register with service discovery. This is what we call an offloaded deployment and it guarantees that other users making requests to this service won’t get routed through this (potentially broken) instance.
Embed override information in request headers that our infrastructure knows how to interpret. Ensure this override metadata gets propagated throughout the entire request call graph.
Modify our per-service routing rules to respect override information provided in the request headers. Route to the offloaded deployment when the override metadata specifies so.

Example Scenario

Let’s say a user wanted to test a new version of the onboarding service in an E2E scenario. Previously with Onebox, a user would spin up an entire copy of the Lyft stack and modify just their service to validate it works as expected.

With this new approach in staging, users share the environment, but can spin up offloaded instances that don’t interrupt the flow of normal staging traffic.

Typical users making requests to staging won’t be routed through any live offloaded instances

By attaching specific headers to the request (“request baggage”), users can then opt their request into flowing through their new instance:

Header metadata allows users to modify call flow on a per-request basis

In the rest of this article, we’ll dig into how we built these components to provide this integrated debugging experience.

Offloaded Deployments

Lyft uses Envoy as its service-mesh proxy, handling all communication between our many services

At Lyft, every instance of a service is deployed alongside a sidecar Envoy which acts as the sole ingress/egress for that service. By ensuring that all network traffic flows through Envoy, we provide developers with a simplified view of traffic that gives service abstraction, observability, and extensibility in a language-agnostic way.

A service calls an upstream service by sending a request to its sidecar Envoy, and Envoy forwards the request to a healthy instance of the upstream. The sidecar Envoy’s configuration is kept up-to-date via our control plane, which uses Kubernetes events to update via xDS APIs.

Avoiding Service Discovery

If we want to create an instance that won’t get traffic from normal services in the mesh, we need to indicate to the control plane that it wants to be excluded from service discovery. To accomplish this, we embed extra information in our Kubernetes pod labels that denote the pod as offloaded:

...
app=foo
environment=staging
offloaded-deploy=true
...

We can then modify our control plane to filter out these instances, ensuring they don’t receive standard traffic in staging.

When a user is ready to create an offloaded deployment in staging (after local iteration), they first have to create a pull request in Github. Our continuous integration will automatically kick off a container image build that’ll be needed for the deployment. Users can then leverage a Github bot to explicitly offloaded-deploy their service into staging:

Our Github bot makes it simple to create an offloaded deployment from your PR

With this implemented, users can create an isolated deployment of a service that shares the exact same environment as normal staging deployments: it talks to standard databases, egresses to other services, and can be observed by our standard metrics/logs/analytics stack. This has proven useful to some of our developers who simply want to ssh into an instance and test out a script or run a debugger without worrying about affecting the rest of staging. However, offloaded deployments’ real power comes when devs can open the Lyft app on their phone and ensure their requests are properly served by their PR’s code in an offloaded deployment.

Override Headers and Context Propagation

To route requests to these offloaded deployments, we need to embed metadata in the request that informs our infrastructure when to modify the call flow. This metadata will contain which services we want to override routing rules for and to which offloaded deployment we should direct traffic. We decided to put this metadata inside a request header so that it would be transparent to services and service owners.

However, we needed to ensure that header would be propagated by services written in several languages as well as throughout our mesh. We already used an OpenTracing header (x-ot-span-context) to propagate trace information from one request to the next. OpenTracing has a concept called baggage, which is a key/value structure embedded within the header that persists across service boundaries. Embedding an encoded form of our metadata inside the baggage allowed us to make quick progress because our request and tracing libraries already propagated it from one request to the next.

Creating and Attaching Baggage

The actual HTTP header is a base64-encoded trace protobuf. We created our own protobuf named Overrides that needed to be injected within that trace’s baggage, demonstrated by this code:

Sample data structure we can embed in the trace baggage

How we extract the current trace and enrich it with our overrides

To abstract away this data serialization from our developers, we added tooling to our existing proxy application for header creation (read more about our proxy). Developers point their client at this proxy, which lets them intercept request/response data with user-defined Typescript snippets. We created a helper function setEnvoyOverride(service: string, sha: string), which will look up an IP address via the sha, create the Override protobuf, encode the header, and ensure it’s attached to each request flowing through the proxy.

Context Propagation

An important aspect in any distributed tracing system is context propagation. We need the header metadata to be available throughout the lifetime of a request to ensure that services many calls deep have access to the user-specified overrides. We want to guarantee each service along the way properly forwards the metadata along to services later in the request flow — even if the service individually does not care about the content.

Every service in the call graph must propagate the metadata for full trace coverage

Lyft infrastructure maintains standard request libraries in our most used languages (Python, Go, Typescript) that handle context propagation for our developers. Assuming service owners use these libraries to egress to another service, context propagation is transparent to users.

Unfortunately during the rollout of this project, we found context propagation was not as ubiquitous as we would’ve hoped. Our initial users frequently came to us saying their overrides weren’t getting applied to their requests, and dropped traces were often the culprit. We invested significantly to ensure that context propagation worked across various language idiosyncrasies (ex. Python gevent/greenlets), multiple request protocols (HTTP/gRPC), and through various async jobs/queues (SQS). We also added observability and tooling to diagnose issues involving dropped traces, such as dashboards which identify services egressing without the header.

Extending Envoy

Now that we have propagated overrides metadata in our requests, we needed to modify our networking layer to read the metadata and redirect to the offloaded instance we want.

Because all of our services make requests through a sidecar Envoy, we can embed some middleware in those proxies to read the overrides and modify the routing rules appropriately. We leveraged Envoy’s HTTP filter system to hook into request processing. The HTTP filter we created really boils down to two steps: reading override information off of the request headers, and modifying the routing rules to redirect the route to the offloaded deployment.

Tracing in Envoy HTTP Filters

We decided to create a decoder filter, allowing us to parse and react to overrides before the request is sent to the upstream cluster. The HTTP filter system supplies a simple API exposing the current destination route and all the headers for the in-flight request. Though implemented in C++, some pseudocode captures the general gist:

Our filter uses Envoy’s tracing utilities to extract the overrides contained within the baggage. While filters have always had access to trace information like traceId and isSampled, we had to make this project’s first contribution to Envoy in order to enable baggage retrieval: tracing: add baggage methods to Tracing::Span. With this change merged, our filter could use the new API to grab baggage in the underlying trace: routing_overrides = headers.trace().baggage()['overrides']

Original Destination Cluster

Assuming the overrides apply to the current destination cluster, we must redirect the request to our offloaded deploy. We used Envoy’s original destination (ORIGINAL_DST) to send the request to a baggage-supplied override.

With the ORIGINAL_DST cluster we configured, the eventual destination is decided by a special x-envoy-original-dst-host header, which will contain an ip/port like 10.0.0.42:80. Our HTTP filter can mutate this header to redirect the request.

For example, if the request was originally intended for the users cluster, but a users override ip/port was passed in the baggage, we’d mutate the x-envoy-original-dst-host to whichever ip/port was provided.

After the x-envoy-original-dst-host has been modified, the filter needs to send the request to the ORIGINAL_DST cluster to ensure the new destination is respected. This requirement led us to our second Envoy change: http: support route mutability. With this change merged, our filter could mutate the route’s destination cluster: route.set_cluster('original_dst_cluster').

Results

With offloaded deployments, propagated baggage, and an Envoy filter, we’ve now shown all the major components of our staging overrides product 🎉

This workflow has drastically improved the process required for E2E confidence. We now have >100 unique services deploying offloaded deployments per month. Compared to our previous Onebox solution, staging overrides has the following advantages:

Environment provisioning: Onebox required users to spin up hundreds of containers and run bespoke seed scripts, stalling devs for >1hr before their environment was ready to test. With staging overrides, a user can have their PR in an E2E environment in <10 minutes.
Infrastructure parity: Onebox ran an entirely separate tech stack from staging/production, so underlying infrastructure components (eg. networking, observability) were often implemented separately. By moving E2E testing to staging, we’ve lowered infrastructure support cost due to centralized environment improvements.
Functional parity: Because of the differences between Onebox and production, users were often (rightfully) hesitant about the correctness of their code even after Onebox E2E testing. Staging resembles production much more closely with respect to data and traffic patterns, giving users more confidence their staging-ready code is also production-ready code.

Additional Work

Getting staging overrides launched was a cross-org effort involving networking, deploys, observability, security, and dev tooling. Here’s some additional work streams not covered:

Configuration Overrides: Alongside routing overrides specified in the baggage, we also enabled users to modify configuration variables on a per-request basis. By modifying our config libraries to give precedence to the baggage, we let users flip feature flags for their request before enabling globally for the environment.
Security Implications: Because overrides can dictate routing rules, we had to lock down functionality of our filter to ensure bad actors couldn’t route arbitrarily.

Future Work

Going forward, there’s a lot more we can do with staging overrides to let users recreate the E2E scenario they are looking to validate:

Shareable baggage: Provide users a centrally managed baggage store which lets them persist a unique set of overrides (service foo is X, service bar is Y, flag baz is Z). This will improve collaboration by making it trivial to share an exact scenario with a teammate.
Override use cases: We can teach our infrastructure about other overrides to give users control over how their request acts. For example, we could inject synthetic latency into requests with Envoy fault injection, temporarily enable debug logging, or redirect to different databases.
Integration with local development: Rather than requiring users to spin up an instance of their PR in staging, we can allow overrides that reroute the request directly to a user’s laptop.

Stay tuned for the next post in our series that will show how we gate production deployments with our automated acceptance tests in staging!

If you’re interested in working on developer productivity problems like these then take a look at our careers page.

Special thanks to following people that helped to create this post: Garrett Heel, Brian Balser, Jake Kaufman, Rithu John, Scott Wilson, Daniel Metz, Michael Meng, Priyanka Samarth

Lyft Engineering