Improving the Experience of Making Envoy Route Changes

In a microservice world, significant route configuration changes to the front proxy are often required to keep up with an evolving business. Making these route configuration changes easily manageable and testable quickly becomes a challenging issue with the number of developers needing to make changes and the frequency of these changes.

The Old Way

At Lyft, developers used to create Envoy route configurations, specifically Route Discovery Service (RDS) responses, in JSON by leveraging Jinja. These were subsequently uploaded to S3 and consumed by the control plane.

We also used the Envoy OSS route check tool to verify those generated route JSON files.

A New Approach

To avoid directly updating the route configuration in JSON and decouple developers’ inputs from the Envoy API, we introduced our own protobuf for route configurations. This allowed us to have finer-grained control on the Envoy features we need to support at Lyft.

After the route configurations are changed in protobuf, they are first converted into JSON and then uploaded to S3. Eventually, the control plane reads the JSON files from S3, materializes them into RDS responses, and finally delivers the configurations to the Envoy sidecar.

Additionally, as the implementation of the route check tool is straightforward, simple enough to maintain in-house, and can be updated easily to catch up with improvements to the open source (OSS) version, we chose to replace the OSS route check tool with an in-house version that consumes our internal protobuf format. This saves developers’ time by allowing them to to run the route-check unit tests locally without compiling Envoy.

In the following sections, we will discuss these improvements in more detail.

User Experience

In the old way, route configurations involved heavy use of Jinja for templating. As there is minimal support for Jinja syntax and schema validation, it became challenging to make non-trivial changes like adding additional response headers to all routes.

To enable syntax support for route configurations, we choose to use Go, as it is widely used and well-supported at Lyft. By treating route configurations as code, we are able to easily manipulate and write tests against them, which improves developer efficiency.

Example: Rolling out security fixes for all endpoints

Building route configurations programmatically allows us to set up entry points to add and update routes to all configurations easily. The following is an example of how easily additional routes can be added to all virtual hosts to handle security incidents.

API Compatibility

In the past, the Envoy API was referenced in both Envoy configuration and the control plane. When upgrading the Envoy API, changes had to be coordinated to avoid API incompatibility issues. As we occasionally need to upgrade the Envoy API due to Envoy deprecated fields, we needed to simplify the process.

In the new approach, as the Envoy API is only referenced in the control plane, we can more easily update it without needing coordination.

Example: Improved workflow for Deprecated fields

Previously, the following steps had to be taken in order to remove deprecated fields:

  1. Update “Envoy configs_”_ to remove the deprecated fields and replace them if required
  2. Update “Control plane_”_ to the API version that actually removes the deprecated fields

With the new approach, we only need one change to remove deprecated fields:

  1. Update “Control plane_”_ to the API version that removes the deprecated fields and transform the internal protobuf inputs to corresponding API fields.

Local Development and Testing

Before, we relied heavily on the Envoy OSS route check tool to ensure new route changes would not cause issues; however, this quickly became a bottleneck to developer productivity. The OSS route check tool requires the Envoy OSS binary, and we found it challenging to make the Envoy OSS binary available locally to all developers at Lyft. The slow feedback loop for debugging route test failures would prevent this process from ever becoming truly self-service.

Since we built our internal route check tool, developers are now able to run it locally and debug test failures by themselves.

Testing in CI

The output from the Envoy OSS route check tool is a stream of results which includes results for individual route tests and coverage of various routes defined for various domains. This output was often overwhelming and confusing to developers.

Replacing the OSS tool with the in-house library for route checks allows using separate Go unit tests for individual route checks and route coverage checks, making the test output failures clearer in context.

Additionally, our CI infrastructure understands the output of Go unit tests and therefore can flag specific failing tests and log output making it easier to find the failures.

Migration Journey

At Lyft, we needed to migrate lots of route configurations to the new format. In the following sections we will talk about how these configurations were migrated and how we built up confidence during the migration process.

Migration Steps

The first step was to rewrite the existing route configuration into our internal protobuf format and then serialize it into RDS JSON. The following diagram shows what changed between the initial stage and the interim stage: we switched to complete the route configuration by self-defined protobuf in “Envoy configs”.

After the interim stage was completed, we started to directly serialize the self-defined protobuf into JSON and upload it to S3 along with the existing RDS JSON. As the final step, the logic in the Go tool was migrated from “Envoy configs” to the control plane.

Building Confidence

We mainly took two approaches to increase the confidence during migration.

The first one was to increase the awareness of fundamental changes. We checked in all generated RDS JSON files before the migration so that we could easily check the differences as part of the code review process when starting to rewrite the route configuration in self-defined protobuf.

For the second approach, we split the route configurations into multiple batches and started with the least-important ones. This allowed us to build confidence early on with less risky parts of the migration and smoothly transition to the new format.

Future development

With the protobuf format in place, this opens up future opportunities around exposing these configurations via APIs to management-plane services, which would further improve user experience like using web UIs for common workflows.

Acknowledgments

Special thanks to the colleagues on the networking team for their contributions: Abhinav Dahiya, Chao-Han Tsai, and Jacky Tian.

Lyft is hiring! If you’re interested, check out our careers page.

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.139.0. UTC+08:00, 2024-12-25 12:39
浙ICP备14020137号-1 $访客地图$