Continuous Deployment at Lyft
Continuous Deployment (CD) is when software changes, such as new features and fixes, are automatically deployed to our customers as quickly and safely as possible. At Lyft, we pride ourselves on “making it happen”, so in 2019 we set out to move from our manual deployment process to CD. The microservice architecture at Lyft made our path to CD challenging, as adoption required changing how each team maintained and deployed their services. Not only did we have to make configuration changes to each service, but most importantly we had to drive cultural changes in how each team approached and performed their deployments. We recently hit a new milestone of 90+% of our approximately 1,000 services using CD to deploy to production. We would like to share the journey, technical details, and challenges we faced.
Background of Deployments at Lyft
Each project at Lyft gets its own repository with a file that defines infrastructure metadata. This file includes a description of the deployment pipeline for that project. For example (some details omitted for simplicity):
The above yields a pipeline in Jenkins that would look like this with the Jenkins pipeline plugin:
Jenkins Deploy Pipeline
Previously, the process for deploying a project required the owning team to manually start a new pipeline and run each step. This wasted engineering time and added cognitive overhead of monitoring deployments while trying to get other work done. Because this process was manual, it was common for commits to be batched up, which required a lot of coordination on Slack during the rollout and made it harder to determine which commit caused an issue. Manual deployments also meant that changes took a long time to deploy, a full deployment took around an hour on average, or were not deployed until some forcing function e.g. a security patch for log4j. By the time a change reached production, engineers had much less context on each change.
One of the time consuming parts mentioned above is waiting for a deploy job to bake. Baking a deploy refers to letting it sit in the environment processing requests long enough to understand whether anything is broken as a result of the deployment. Baking also allows enough time for code to be exercised, metrics to propagate, and alarms to fire. It’s designed to catch bugs that manifest during runtime like an elevated rate of 5xx responses or a growing memory leak. With that in mind, the allocated bake time is the minimum amount of time required to see the effects of a deployment, serving as one prong for making deploys safe — there are a lot of subtle bugs that can take more time to manifest.
For each of the jobs in the above pipeline, the user would have to click every step manually after the minimum bake time had elapsed while also monitoring dashboards, logs, and alerts. If the deploy didn’t bake long enough, the deploy would fail and the user would have to try again later.
In 2018 we updated most pipelines to automatically deploy to staging upon merge. At the time, we did not have the tooling to safely automatically deploy beyond the staging step. Passing tests on the main branch meant the change was likely safe for staging, but we lacked additional checks to confidently allow changes to advance further automatically. Even this change was a cultural shift in how engineers at Lyft deployed — at the time, it was common to merge without necessarily having confidence that the code was ready to deploy despite there rarely being any additional verification after merge to promote to staging. As stated in Scaling Productivity on Microservices, deploying to staging is consequential and many engineers rely on the staging environment.
AutoDeployer and DeployView
In 2019 we set out to eliminate the number of manual steps needed in a deployment. We designed a worker, called AutoDeployer
, which operates over a state machine. For each deploy job AutoDeployer
evaluates the current state and determines if it should trigger a transition to another state. For each possible transition, the state machine works through a series of checks called Deploy Gates that determine whether a job is allowed to move into the next state. If it is safe to transition, AutoDeployer
triggers the deployment using Jenkins. AutoDeployer
focuses on jobs in a waiting
state and evaluates the associated gates. For each job, if one or more gates fail AutoDeployer
leaves the job in the waiting
state. However, when all the gates pass, the job is allowed to transition through the next states: queued
, running
, and then eventually success
or failure
. Before diving into more detail on gates, we need to introduce DeployView
.
DeployView
is a UI to visualize the automatic deploy pipelines. DeployView
shows richer context, including deploy state, and allows users to directly interact with the auto deployment system for approvals and overrides. In 2021 we moved teams over to use the new DeployView
UI. As the image below shows, users no longer interacted directly with Jenkins, but it still handled the execution of each deploy job behind the scenes.
DeployView <> AutoDeployer Architecture Design
We mentioned gates as our checks for deployments to advance — so let’s go over how they work.
Deploy Gates
Deploy Gates help us safeguard against bad code changes. They are a set of standard checks that AutoDeployer
performs on each deploy job to make sure that it is safe to go out.
Deploy Gates allow us to not only evaluate the safety of deploying a particular job, but also to surface this information to developers. When a deploy is blocked or even after it runs, developers can see why each gate triggered or blocked the deployment. We also added detailed logs about each gate evaluation so that we could track any issues with our gates. During rollout, the logs were particularly useful to identify issues with gate evaluation and latency and allowed us to improve our system.
By default every service gets a set of standard gates to run:
- Alerts Gate: blocks a deploy if there are open pages on the service.
- Bake Gate: blocks a deploy until the bake time for the job has elapsed.
- Deploy Hours Gate: blocks a deploy outside of the service’s business hours.
- Tainted Revision Gate: blocks a deploy if the targeted commit has been “tainted” (manually marked as “bad”) by a user.
While we were developing this, Amazon published a post on their own auto deployment system which featured very similar checks.
We have a consistent way to surface why certain deploys are or are not allowed to progress on the DeployView
pipeline page. The example screenshot of DeployView
pipeline page below shows a job that will not automatically advance because of pending gates:
DeployView Deploy Pipeline Page
In this case, the bake time is something that will eventually complete. Assuming no alerts or other blocking reasons surface this should eventually deploy.
In addition to the default gates, teams can define custom gates via the extensible system we designed with the basic interface:
You can see a success story of other teams at Lyft creating their own gates in our earlier post on Gating Deploys with Automated Acceptance Tests.
Rollout and Customer Interactions
Continuous deployment to production was first tested by the Deploys team, and a few volunteer teams. This early feedback allowed us to make major changes to how teams thought about and conducted deployments and helped us write onboarding and best practices documentation.
After rolling out to a few more teams, we held conversations with our customers to determine if we had met their needs. These conversations established that we were ready to start rolling out to more organizations within engineering. In reflection, we could have started rolling out to teams beyond our own much earlier, as teams were happy with AutoDeployer
and just needed support to move to it. This prompted us to do an en masse rollout to get to the 90+% of services using AutoDeployer
today and we were met with welcoming reception.
We’ve realized an unexpected risk — on occasion, some engineers assume that automatic is smarter than it is today. This is a hard problem to mitigate even with documentation on what checks we have today**.** However, the nice thing about the Gate
Interface is any team can add a new gate easily.
Next we will work with remaining teams to determine what additional checks or changes are needed to get them on AutoDeployer
.
Results
We have observed a great number of benefits with automatic deployments. One of the biggest is the number of commits per production deploy has significantly decreased from around 3 to around 1.4 commits. Fewer commits per deploy means changes in production are more predictable and easier to monitor.
Other benefits are faster feedback loops on new changes and engineers can continue to work uninterrupted after their PR is merged knowing their change will safely go into production. Teams used to deploy once or twice a day in large trains that could take a couple of hours to complete, whereas now teams deploy multiple times a day with every merge. Faster feedback loops means less time between merge and code in production, now around 45 mins, thus engineers still have good context on changes in recent production deploys in case they need to respond to issues.
Future Work
We intend to develop more Deploy Gates to improve the safety of deployments. The primary ones being a set of gates monitor anomaly detection metrics that could catch issues that are difficult to alarms on for example various business metrics.
Now that we have reduced our dependency on Jenkins by replacing the Jenkins pipeline view with Lyft’s DeployView
, we are evaluating our deploy job execution system and plan to replace it in near future. This will allow us more flexibility and control over the deploy executions themselves and integrations opportunities like with Lyft’s Clutch platform. Clutch is already used by developers at Lyft, so integrating with it will better unify and streamline their developer experience.
….
If you are interested in working on auto deployment or similar infrastructure problems like this refer to our careers page.
Special thanks to all the people that contributed to the CD system over the last few years and this blog post: Jerod Bury, Elliott Carlson, Yiwen Gao, Garrett Heel, Mun Yong Jang, Tuong La, Daniel Metz, Sendhil Panchadsaram, Katherine Pioro, Frank Porco, Arvind Subramanian, Patrick Sunday, and Tom Wanielista