FacetController: How we made infrastructure changes at Lyft simple

Written by Miguel Molina and Arvind Subramanian

Logo for FacetController

FacetController

If you are curious about Lyft’s automatic deployment process on a higher level, please read our blog post on Continuous Deployment.

In this post, we will go a little deeper into the deployment stack and how we leverage Kubernetes Custom Resource Definitions (CRDs) to create an abstraction on top of Kubernetes native resources, known at Lyft as facets. Additionally, we will discuss the new controller we developed to manage these facets and to streamline infrastructure rollouts across the company.

What are facets?

When deploying code, each Lyft microservice is composed of smaller deployable Kubernetes components called facets. There are several facet types representing different deployable Kubernetes objects, and they are defined in a generic manifest.yaml file within each project’s repository.

The following are some of the facet types we have at Lyft:

Service facets

These facets receive and send traffic, typically web servers containing APIs for a microservice. In this example the service facet will have different autoscaling min and max sizes per environment, and the HPA will scale up when CPU utilization reaches 70%.

- name: webservice
container_command: go run main.go
type: service
autoscaling:
criteria:
cpu_target: 70
environment:
staging:
min_size: 5
max_size: 20
production:
min_size: 5
max_size: 200

This metadata ensures that the Kubernetes resources for a ReplicaSet, Service, Deployment, Configmap and an HPA are created.

Worker facets

These facets only send traffic and typically do some offline processing of work, like taking items from a queue and performing some action.

- name: offlineworker
container_command: somestartupcommand.sh
type: worker
autoscaling:
min_size: 1
max_size: 1

This metadata ensures that the Kubernetes resources for a ReplicaSet, Service, Deployment, Configmap and an HPA are created.

Cron facets

These facets run workloads on a schedule. For example, once a week on Sunday.

- name: mycron
container_command: somestartupcommand.sh
type: cron
schedule: 0 0 * * SUN

This metadata ensures that the Kubernetes resources for a CronJob are created.

Job facets

These facets run workflows once at deployment time and then gets terminated and deleted.

- name: s3uploadjob
container_command: upload_data_to_s3.py
type: job

This metadata ensures that the Kubernetes resources for a Job are created.

Batch facets

These facets contain code for a workflow that can be invoked by the user whenever an action is needed. For example, running a DB migration.

- name: dbmigrationbatch
container_command: migration.py
type: batch

This metadata ensures that the Kubernetes resources for a Job are created.

Deploying Facets

Lyft developers can reference and target facets in their microservice’s deploy pipeline with a controlled rollout. This allows a deploy step to target an environment or a specific percentage of that environment as well as target individual facets of the service using the target_facets field. For more details on pipeline structures, refer to the Continuous Deployment blog.

For example, a deploy pipeline might look like this:

deploy:
- name: staging
automatic: true
environment: staging
target_facets: [webservice, offlineworker, dbmigrationbatch]
- name: canary
environment: production
bake_time_minutes: 10
automatic: true
target_facets: [webservicecanary]
- name: one-third-of-production
environment: production
bake_time_minutes: 10
automatic: true
target_facets: [webservice]
- name: production
environment: production
automatic: true
target_facets: [webservice, offlineworker, dbmigrationbatch, s3uploadjob, mycron]

Problems

Early on during Lyft’s migration to Kubernetes (2019–2020), the infrastructure was rapidly evolving how Kubernetes deployments and manifests were configured (the templates defining deployments changed a lot!). At the time, any updates to these templates could only be propagated when a new deployment was triggered.

At deployment time, the system reads the project manifest, translates it into relevant Kubernetes objects (Deployments, ConfigMaps, HPAs, RoleBindings, ServiceAccounts, etc.), and applies these objects to the relevant Kubernetes clusters. We used to template Kubernetes files at deploy time, where user defined configuration was used in combination with some static logic, similar to helm-style deployments.

With over one thousand microservices, each containing a number of facets, this meant thousands of deployments / redeployments were needed to update facet objects with any template change. Hence, any major changes to environment variables, scaling, or other configurations were difficult to track rollout of and fully converge across all services, and also equally difficult to roll back in an emergency.

Each time we needed to add a field to a facet type or have any infrastructure wide migration, this required heavy manual tracking, like using a spreadsheet to know what still needed deployment, and required lots of coordination with every service team. This process would typically take many weeks or months to make any change to a template, like simply adding or renaming a field.

These problems all highlighted that we lacked a high level Custom Resource Definition (CRD) for deployable objects and a way to manage them. So we introduced FacetController.

Solution: FacetController

FacetController manages the lifecycle of facets. Instead of applying all the Kubernetes objects mentioned above, the deploy process will now create or update a singular facet (ex. ServiceFacet, WorkerFacet) resource, configured as a Custom Resource Definition, on a Kubernetes cluster. The facet resource closely resembles the metadata that is exposed to our developers in deployment manifests. When facet specs are updated or during a regular deployment on the cluster, FacetController will pick up this change, create/update the associated child resources (ex. Deployments, ConfigMaps), and delete resources that are no longer required. This allows changes on how these child resources are defined to be quickly and easily propagated to all services at Lyft.

Now all that is required when changing the templates is a deployment of FacetController instead of individually deploying each service at Lyft. Facecontroller effectively saved every infrastructure team from spending multiple quarters on migrations that now only take a few weeks to fully rollout and test safely.

FacetController’s Sync loops architecture design

Design of FacetController

Infrastructure Management is way easier

The biggest benefit of FacetController is that it has given us a way to drive sweeping changes to user services safely and ensures the changes happen in an automated fashion. Some examples:

Changes to Underlying Infrastructure

Autoscaling Changes (Kubernetes/autoscaler to Karpenter)

FacetController enabled our migration to use Karpenter instead of Cluster AutoScaler to manage how our nodes get packed with pods and balanced over time. It allowed us to slowly and safely select projects for deployment to Karpenter-managed nodes by using labels added through FacetController.

Kubernetes Upgrades

As Lyft’s infrastructure has evolved, some Kubernetes clusters are pending deprecation and are running older versions of Kubernetes while other clusters are running newer versions. Even though older clusters may rely on deprecated APIs, FacetController allows for managing different cluster versions by generating the appropriate resources based on each cluster’s specific API version.

Changes to Developer Experience

CPU limits removal

Removal of CPU limits allowed Lyft services to eliminate CPU throttling and let our most critical services burst when needed. The benefits of this has been extensively talked about by others, so here are some articles that explore this topic in more detail: Making Sense of Kubernetes CPU Requests And Limits | by JettyCloud | Medium, Remove your CPU Limits | by Shon Lev-Ran | Directeam, and The container throttling problem | Dan Luu.

Stay tuned for a future blog post on how removing CPU limits unblocked many cost savings initiatives.

Scaling on service container CPU

At Lyft we run many sidecar containers (envoyproxy, stats, logging, etc.) on each pod. CPU for the pod can sometimes be deceiving as the sidecars can skew the average CPU utilization of the pod but the application might be running hotter. This made us realize the importance of also scaling on application container CPU, and we now use the max of the application container CPU and the pod’s overall CPU to have more accurate scaling.

FacetController’s Net Benefits

Proper abstraction for facets and their templates

With FacetController, we now have one unified codebase to manage the lifecycle of facets instead of disparate systems that require individual updates. This consolidation means we now we have one resource to interact with for tools that modify facets (ex. our internal developer platform, command line tools) instead of multiple resources that could diverge between the old tools.

Automatic Garbage Collection (GC) of resources

Before, when deprecating a facet for a service, we would have to manually delete all the objects from that facet, such as the ConfigMap, K8s service, K8s Deployment, etc. Now with FacetController, because each facet has their standard interface/template and management, all of these are automatically GC’d when a facet is removed from a project’s manifest.

No need for en-masse redeploys of services for an infrastructure-level change

This process used to require coordination with service owners and having to re-deploy thousands of services, which would take multiple months for a change to the facet spec. Now most infrastructure-level changes can take minutes to take effect but can still be done in a controlled manner with rollout flags when percentage based rollout is required. This has saved infrastructure engineers many months of work.

Safe rollout of infrastructure-level changes

Despite having changes applied outside of the service’s deployment pipeline, we kept safety as a top priority in the design of how to deploy FacetController. Changes get rolled out on a per-cluster basis and can even be done to select services within a cluster given that we run multiple Kubernetes clusters at Lyft for availability.

Another safeguard we implemented is that concurrent updates to facets are limited to reduce the impact of problematic changes and being able to throttle updates.

Future work

We have fully adopted the controller pattern in different areas of our Kubernetes platform, creating others that use FacetController as an example to design controllers that manage and automate other parts of our infrastructure.

Some services at Lyft require additional resources and configurations outside of the provided templates, often for reasons such as using open source configuration. We refer to these as Direct Facets because they directly apply template files to Kubernetes. These exempt services do not use FacetController and therefore do not get the benefits mentioned above. However, we are actively working on adding generic support for these services so that they can leverage the platform.

….

Special thanks to all the people that contributed to the blog post and FacetController over the last few years: Mike Cutalo, Tuong La, Daniel Metz, Frank Porco, Tom Wanielista, and Yann Ramin.

Lyft is hiring! If you’re passionate about Kubernetes and building new controllers or using FacetController, visit Lyft Careers to see our openings.