The Journey to Cloud Development: How Shopify Went All-in on Spin

Shopify has grown exponentially in recent years. Throughout that growth dev, a homegrown Ruby program, was there to support complex local development environments running on developers’ laptops. This program’s core responsibility is to ensure that a developer is ready to start coding with minimal manual intervention on their first day. Our most complex projects take around an hour to configure on first instantiation and mere minutes to synchronize to the latest changes.

The dev tool is supported by a lightweight virtual machine running on xhyve called railgun. This virtual machine runs an application’s supporting processes (MySQL, Redis, etc). This allows us to partition things a developer might work on from things that they're unlikely to ever work on. It also allows us some degree of control over the supporting processes that an application will rely on.

We, the Environments team, are the developers of this tool. Our core responsibility at Shopify is to make sure that all developers have performant and useful development environments. We meet this responsibility by providing tools that remove repeatable tasks from developers’ day-to-day work into common, easy to reproduce development environments. Our team is part of a larger group that seeks to automate the entire developer workflow—from coding to testing to deployment.

We were incredibly proud of what we had accomplished with dev, but Shopify had outgrown it. Running a local instance of our majestic monolith was spinning fans and spooling swapfiles. Projects at Shopify were becoming more complex with more moving parts. Laptops were melting.

Nothing fit in the box anymore, so it was time to find a new one. This post is about where we looked and how we got there.

Re-Assessing Our Needs

Everything around us at Shopify had slowly gotten more complex. Projects at Shopify had grown to include many related repositories organized into multiple running services. Developers were working in interlocking teams on horizontal concerns that cut across repositories. All of this was organized around a cohort of majestic monoliths. We felt this change in the air, so started talking with teams running some of the most majestic of the monoliths.

Our existing environment management tool—dev—is capable of configuring and running additional repositories alongside the one that a developer is working in. We call these integrations. Since a laptop is a limited, inelastic environment, we have certain checks in dev that cause minor friction when a developer introduces a new integration into a repository’s environment. This is to encourage the developer to check with their team to have a discussion about whether the integration is necessary and, if it is, whether it needs to be run by default. We noticed that there were more and more conversations running against this point of friction. We came to this realization by first observing what developers were doing when they asked for help in our support forums. Further engagement with specific developers led us to meet with a few different representative groups of developers that had been struggling or developing their own tooling that went beyond what dev offered them. Developers needed more services to start up alongside a repository to get their work done, but these same services were impacting other developers that never needed to interact with them. 

This had an impact. Developers working against these monoliths were running out of resources and their code-build-test loop was sluggish. Times to ship were stretching from hours into days into weeks.

At Shopify, we maintain a strong culture of “tophatting” (the etymology here is partially lost to the ages but has something to do with the tophat 🎩 emoji): validating our collaborators’ changes in running applications. There’s always been a certain tension around this process in our larger monoliths. In order to validate a collaborator’s changes, a developer had to replicate the environment in which they did the work. For unmerged changes, this meant that a developer needed to not only stash or commit changes they had been working on prior to agreeing to tophat, but also the state of the application.

There are many dimensions to the state of an application, but for large Rails monoliths, it all boils down to the state of the database. Developers that needed to tophat a change needed to unwind migrations they had made for their work and apply those of their collaborators. As the number of integrations for a repository grew, so did the complexity of unwinding state. Migrations across multiple databases and caches needed to be unwound and new data needed to be created to validate a change. Once a developer completed tophatting, they needed to get back to where they had been. If a change needed updating, the process began again. Developers were often just abandoning state and spending extra time regenerating data for their environments.

Inter-repository complexity was also growing. Developers were implementing features that span multiple repositories, often requiring changes from multiple feature branches to be in play. Developers were trying to devise homegrown solutions to this explosion of possible combinations. Some were running multiple copies of their monoliths and accepting the resource drain. Others were finding ways to run pseudo-staging versions of the monoliths in the cloud on behalf of their team.

Understanding the Complexity

Given the problems of inter-service, inter-repository, and inter-developer complexity alongside increases in resource usage and time-to-ship that we observed, it was clear that something needed to change. What became immediately obvious to us was that the eventual solution would need to be something that introduced significantly more elasticity into developers' daily workflow. We didn't know exactly how to do this, but we suspected the answer lay somewhere in the cloud. We did realize that development environments needed to be able to scale just as well as our production applications. We also realized that production applications weren’t development-friendly environments. This is by design. Changing code and restarting processes, typical activities during development, aren’t things that should be easy in a production environment.

At Shopify, we find that tightly-scoped experiments that we can implement quickly are one of the most efficient means to crystallize a map of concepts into something more actionable. Knowing that our destination lay somewhere in the cloud, we were able to start drawing a rough map in our minds that would permit controlled experimentation around cloud concepts.

The two most notable of our early experiments were providing an automatically configured local Kubernetes cluster and giving developers an easy way to create Google Compute Engine (GCE) VMs.

At the beginning of the experiments, we imagined that having a local Kubernetes would be the most likely answer to the problem. Our production services are ultimately deployed to Kubernetes, so many developers at Shopify have some familiarity with it. We felt having Kubernetes in play throughout the development lifecycle would align well with how developers were already working.

We imagined a moment in the future where developers would run the code they were changing on the local Kubernetes cluster. They might run supporting integrations on nearby development Kubernetes clusters running on GCP. We imagined that developers would move running code between the local and remote clusters at will as they needed to make changes. Being able to move containers between different execution clusters seemed like an excellent lever that permitted a developer to offload resource consumption, at will, as needed.

Reality always cuts to the truth of the matter. Not many developers were interested in the idea. We had some eager early adopters (Kubernetes aficionados and teams already using Kubernetes locally), but the idea failed to gain organic traction amongst the larger developer community at Shopify.

Our itinerary on this journey would be set by the other experiment: tooling to create GCE VMs.

Experiment, Learn, Improve, Repeat: The Journey to Cloud Native

Our initial debates on this situation led to a single, solid conclusion. We would not solve this problem on the first try. We would need to explore the problem, in situ, alongside users. Anything that we would offer them would need to be evaluated in their context by observing how they reacted to the tools we would provide. Rather than building and shipping a product, we would need to build a framework for ongoing exploration that would be open to iteration and feedback. We would need to build a sort of development propulsion laboratory that would react and evolve rather than building the next, incrementally improved, rocket ship. 

Early Indicators of Success: GCE VMs

There was no grand design behind our provisioning of GCE VMs. As an experiment, we added a command to our local environment tooling (dev) that allowed developers to create a GCP VM. The only extra automation that we added was copying a developer's GitHub credentials so that they could clone repositories. Once the VM was created, developers were on their own to customize the machine as they saw fit.

To our surprise, several teams started using these VMs as their daily working environment. One advantage that these developers had is that their repositories were generally self-contained and well-appointed with their own repository-specific automation. Another was that the developers on these teams were very familiar with Linux (some even preferred it) and were able to perform basic sysadmin tasks to get their VMs polished the way that they needed.

With this small, early success, we started planning what we should do next. To us, the most obvious next step would have been to port dev to Linux. This would allow developers with less familiarity with Linux to run the same automation they'd been accustomed to on their Macs. We started this work in early 2020, but other concerns at that time took over the team’s focus, so we left things as they were and continued to observe how developers used their VMs.

Taking Dev Environments to Kubernetes Pods

As the summer closed, we realized that there was more to this story. The small collection of developers using the VMs that we provided had socialized the ideas we were working on. Our community wanted more, but we had nothing to offer. Laptops were still melting, development environments were still sluggish. Technical leadership suggested that we explore whether there was a broader potential to the idea. Acting on these forces, we assembled a small team of Staff developers with deep experience in development tooling and environments.

The first thing that this team established is that we didn’t want to be managing VMs. We wanted a more dynamic relationship between compute power and storage. We wanted to be able to preserve the state of a developer’s environment, but be able to scale the compute up and down as necessary depending on the size of a developer’s project. We also wanted to avoid managing this scaling.

We assumed, based on earlier observations, that there would be no appetite in the community for every developer to become a sysadmin of their own development box. We would need to provide the automation to bootstrap projects that would previously have been provided by dev.

Before designing a solution to meet these constraints, we revisited our earlier work porting dev to Linux. At the time, we made the decision that we wouldn’t follow through with this work. The mood on the team was that we needed to think differently about development environments. If we just ported dev to Linux, we would lose this rare opportunity to completely reconsider the fundamental architecture of a developer's working environment.

To solve for these constraints, we opted for implementing developers’ environments as pods running on Kubernetes. This allowed us to build the common dependencies of most repositories into a large base Docker container image. For each project repository, we defined an inheriting container image that added repository-specific dependencies.

When developers would instantiate an instance of a development environment, a Kubernetes pod would be started. This host container holds git clones of all of the repositories necessary for running the repository. Each repository (in what would be referred to as the workspace) also defined a docker-compose.yml that specified how the related custom repository containers and supporting services should be run together. This composition would be started on the host container with the cloned code volume mapped into the appropriate repository containers holding the correct dependency set.

Developers were able to SSH into the host container, make changes, and restart a subset of the Docker composition to apply the changes. We had a customized nginx configuration on the host container that mapped in requests from a GCP ingress to the appropriate container in the composition. To protect access, developers were required to run a VPN that allowed them to route via the ingress we had defined.

The team was very comfortable with this approach. To be honest, it was an improved version of what we already did in continuous integration (CI), so understanding it was very easy for us. In many ways, this iteration of the project could be called CI with a shell. Unlike CI, code would be changing and developers would need to restart the containers that ran the code they had changed. To solve for this need, we added some basic scripting in the host container that emulated the most-used dev commands that developers used to do things like restarting their application process. We shipped this iteration to a group of early adopters in the fall of 2020 to see what they thought of it.

Going Cloud Native to Reduce Cognitive Load

The first iteration of Spin wasn’t well received by early adopters. Everyone understood the goal that we wanted to achieve, but they felt the result was awkward.The scripting that we'd implemented to emulate dev commands was only superficially similar to that of dev and there was confusion in the subtle differences.

Early adopters found the dual context (edit on host container, run in Docker Compose) to be confusing. To us, it seemed the boundary between the host container and the application container was perceived by developers similarly to how they might perceive development versus production. This separation seemed to invite a form of ceremony that we hadn't intended. For example, we kept getting asked why they couldn't run unit tests in the host container before restarting the application container. Having to edit code in one context and run it in another required developers to be conscious about an activity that had been habitual.

Given this input, we decided that this question of contexts should drive our next iteration. Our first step was to eliminate the idea of a host container by leaning into cloud native concepts. We redefined the workspace as a collection of Kubernetes pods, each running the processes of a single repository container. The supporting services for the repository’s processes would still run in each repository container as a Docker composition. This workspace was deployed into a specific Kubernetes namespace reserved for the individual developer. This allowed us to define a simple networking scheme between the pods where each repository process could easily address the other components of the workspace.

In order to change code, developers would connect to each pod individually over SSH or using Visual Studio Code remote access. Behind the scenes, specialized scripts ensured that the processes running in each pod would reload when changes were made. Within the pod and via our CLI tooling, developers had the ability to manage the running repository processes. Unlike the previous iteration, every repository process was running in the same context as that where developers made changes.

By presenting a less complex environment in each running container, where developers could use normal Linux tooling, developers had a greater ability to understand the context where their repository processes were running. This gave us enough confidence in the solution to make it generally available and start accepting users en masse.

We were really pleased with this iteration of the solution. It represented the culmination of several ideas that the team had had over the previous few years. We had regularly wondered whether an unabashed cloud native development environment would be effective for developers at Shopify. Due mostly to history, the manner in which repositories configured their development environment could tend to significantly deviate from the way that the same processes were deployed to production. We had long felt that strong encouragement for every repository to follow solid cloud native practices (for example, 12 factor) might alleviate some of this ongoing tension. By choosing to break the workspace into individual pods for each workspace, developers were strongly encouraged to take steps to make their development configuration meet a cloud native style.

We imagined development environments with seamless transitions between local and cloud. A developer might run an application locally, in a container, while doing significant surgery on it. This same container, when the changes were finished, could be shipped off to a nearby development cluster to be available to the rest of the team for collaboration. Teams could collaborate by shipping in-development container images to each other to run in their own array of development containers.

Lowering the Barrier to Entry

Our cloud-native iteration of Spin saw broad uptake among developers. Our observations suggested that Spin had become the development environment of choice for many developers. This gave us the opportunity to observe it in use among a variety of personal styles. We came to understand what aspects that developers understand and which presented challenges. Most of the challengers developers faced fell into two categories: understanding the infrastructural relationship between the different pods that made up their environment and learning how to configure a repository to work in the environment.

During this period, the team came to the realization that the new system that we implemented had come to resemble the way we configured local environments. Each repository had come to contain a Dockerfile that configured the pod where the repository's processes would run. This is generally mapped to dev’s own configuration file called dev.yml. The docker-compose.yml that specified a repository's supporting services had come to resemble the railgun configuration file. The awkward fact of this emergent symmetry was that a Dockerfile and docker-compose.yml represent strictly lower level abstractions compared to the developer facing abstractions that had emerged over our years of working on dev. For example, to add ruby@2.7.5 to their development environment using dev, a developer simply needs to add ruby@2.7.5 to their dev.yml. When using a Dockerfile, developers were required to write the fine-grained steps needed to download, unpack, and install a ruby version.

Around this time, Shopify CEO Tobi Lütke met with the team to offer his opinion on the current iteration of the project and to understand our vision for the long term. He perceived that developers were required to understand too much about how the project was implemented and how the running parts were orchestrated. He advised us that the team’s role was to create abstractions that permitted developers to defer their understanding of development environment construction until they were curious about it. For example, no developer is required to deeply understand the ruby interpreter in order to write Rails applications. The same should be true about development environments.

Simplifying and Streamlining with Isospin

Given this new understanding, our new goal was simplification and abstraction. We decided that developers should be presented with something that feels very similar to the laptop in front of them. Therefore, we started working on a laptop in the cloud that could, where possible, abstract away the configuration steps needed to run the project.

While moving in this new direction, rather than immediately discard the previous Kubernetes-based solution, we decided to continue to accept new projects onto it and to support existing projects using it. While this did slow down development of the revised solution, it allowed us to observe the needs of users and integrate these needs in the new work. This process of observation affected the new iteration. It began to orient itself around classes of tasks that developers would need to perform to get their projects running in development.

What emerged from this process was a Linux distro with customized tooling that we called Isospin. We took what we had learned during the previous iterations and translated it into a collection of tooling that was orchestrated by systemd. This was implemented by having a tree of systemd units that were triggered when the developer’s instance booted. The lowest units on the tree cause repositories to be cloned. The next immediate dependents trigger scripts that classified and subsequently configured the cloned repositories. During the configuration process, scripts first make changes to the environment based on the discovered needs of a repository. Once the repository is configured the scripts generate additional systemd units that start needed services (MySQL, Redis, etc) or the repository’s own processes.

Effectively, the systemd tooling that we wrote replicated the partitioning concepts of either Docker Compose or Kubernetes (using the same CGroup technology) without forcing the user's project to be spread across multiple pseudomachines. This allowed developers to work on everything in place and avoid managing multiple contexts while working.

This new iteration also liberated our team. We were once again able to focus on our core skill set: writing CLI tools to accelerate developers’ code-test loop. We could spend far less time writing Kubernetes orchestration logic. Most importantly, we had accomplished our core goal: to change the abstraction. Now, developers had a single place to work and they didn’t need to understand the infrastructure that we maintained for keeping that place running.

At this point, the previous iteration of the project would be labeled Spin Legacy and the new solution would be called Isospin. Once we had confidence that Isospin could support our existing users, we began a gradual process of migrating projects from Legacy to Isospin. By January 2022, this process was complete.

Lessons Learned Along the Way

After migrating all of our users to the new Isospin environment, we realized that we were at a moment in the journey where we could stop to reflect on the work we had done over three major releases of our cloud-based environments. We came to two key realizations:

1. It's better to be first than last. Historically, dev has arrived last on a developer’s Mac. Before it’s installed, Apple and the user have had the opportunity to constrain and configure the environment. Since this has always been the case, a lot of our work on dev has been dealing with the circumstances that we arrive in. By taking control of the OS and configuring it to meet our needs (and subsequently the developer’s needs), we gain a significant amount of freedom that has forced us to reconsider many of the assumptions that we’d previously considered to be immovable objects. Most significantly, we no longer have to work around awkward constraints that are only tangentially related to development work. Now, all the constraints are ours to define. A simple example of this is: we don't have to handle the older stable ruby (or lack of it) that Apple may have installed and we don’t need to remove non-standard ruby versions the developer may have introduced in anticipation of starting to work at Shopify. Instead, we install a common set of ruby versions that are in use across projects.

2. A development environment is an application. During the development of Isospin (and to some extent before), the team searched for a great noun to refer to the collection of repositories that would be in a workspace and later on the Isospin instance. We needed an abstraction that communicated a greater than the sum of the parts message. What evolved was a constellation, a collection of repositories configured under the assumption that they’ll be working together to build a development environment. Upon adopting this noun, we realized that a constellation is, in fact, an application unto itself. Therefore, development environments are a strange form of application programming. And, furthermore, rather than providing scripts and automation, we should be providing an application framework or platform for building this type of application.

The Journey Isn’t Over Yet

In the firm tradition of our team, the current Isospin implementation is very, very scrappy. It needs tons of polish, but of all the iterations in our history, it has proven to be the best at supporting developers working on our majestic monoliths. A Spin instance now has the feel of a little Linux box that developers can tinker with. We're now evolving the concepts we realized during our period of reflection to take this scrappy solution and build it into the development environment application development library that it needs to be.

We'll be bringing an old friend along for the ride. During the development of Isospin, we realized that we were just writing a variant of dev. When we looked back, we realized that's what we were doing all along. The basis of our framework will be built on this venerable tool.

Dev won't be alone on this trip. We've made some new friends along the way. We've realized that dev requires developers to be explicit about their needs (which libraries, which runtime versions, etc). Developers can build environments faster when we infer the majority of their needs from the contents of their repository. We'll extract the dependency inference from the current Isospin to make dev far better at guessing the needs of a repository.

We continue to have some concerns about baggage we've accumulated in our supporting infrastructure. We still run instances as Kubernetes pods. This builds an uneasy tension between ephemerality and persistence that tends to degrade the trust of developers if pods are relocated. A large part of the next phase of the journey will be to find the balance in this tension.

Overall, this has been a crazy journey for the team. Trying to build generalized development environments and contribute to a common developer infrastructure, at scale, has been some of the most meaningful work of our careers. We continue to be grateful that Shopify has realized that everyone wins when a small team takes a moment or two to figure out how to build the booster rockets that accelerate tons of impactful developers.

Don Kelly is a Production Engineering Manager on Shopify's Developer Infrastructure team.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by design.

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.124.0. UTC+08:00, 2024-04-25 16:18
浙ICP备14020137号-1 $访客地图$