Spin Cycle: Shopify’s SFN Team Overcomes a Cloud-Development Spiral

You may have read about Spin, Shopify’s new cloud-based development tool. Instead of editing and running a local version of a service on a developer’s MacBook, Shopify is moving towards a world where the development servers are available on-demand as a container running in Kubernetes. When using Spin, you don’t need anything on your local machine other than an ssh client and VSCode, if that’s your editor of choice. 

By moving development off our MacBooks and onto Spin, we unlock the ability to easily share work in progress with coworkers and can work on changes that span different codebases without any friction. And because Spin instances are lightweight and ephemeral, we don’t run the risk of messing up long-lived development databases when experimenting with data migrations.

Across Shopify, teams have been preparing and adjusting their codebases so that their services can run smoothly in this kind of environment. In the Shopify Fulfillment Network (SFN) engineering org, we put together a team of three engineers to get us up and running on Spin.

At first, it seemed like the job would be relatively straightforward. But as we started doing the work, we began to notice some less obvious forces at play that were pushing against our efforts.

Since it was easier for most developers to use our old tooling instead of Spin while we were getting the kinks worked out, developers would often unknowingly commit changes that broke some functionality we’d just enabled for Spin. In hindsight, the process of getting SFN working on Spin is a great example of the kind of hidden difficulty in technical work that's more related to human systems than how to get bits of electricity to do what you want.

Before we get to the interesting stuff, it’s important to understand the basics of the technical challenge. We'll start by getting a broad sense of the SFN codebase and then go into the predictable work that was needed to get it running smoothly in Spin. With that foundation, we’ll be able to describe how and why we started treading water, and ultimately how we’re pushing past that.

The Shape of SFN

SFN exists to take care of order fulfillment on behalf of Shopify merchants. After a customer has completed the checkout process, their order information is sent to SFN. We then determine which warehouse has enough inventory and is best positioned to handle the order. Once SFN has identified the right warehouse, it sends the order information to the service responsible for managing that warehouse’s operations. The state of the system is visible to the merchant through the SFN app running in the merchant’s Shopify admin. The SFN app communicates to Shopify Core via the same GraphQL queries and mutations that Shopify makes available to all app developers.

At a highly simplified level, this is the general shape of the SFN codebase:

Diagram of SFN codebase. SFN is a large rectangle in the center containing six boxes labelled Subcomponent. Outside of the SFN box are six directional arrows. Each arrow connects to boxes called Dependency

SFN’s monolithic Rails application with many external dependencies

Similar to the Shopify codebase, SFN is a monolithic Rails application divided into individual components owned by particular teams. Unlike Shopify Core, however, SFN has many strong dependencies on services outside of its own monolith.

SFN’s biggest dependency is on Shopify itself, but there are plenty more. For example, SFN does not design shipping labels, but does need to send shipping labels to the warehouse. So, SFN is a client to a service that provides valid shipping labels. Similarly, SFN does not tell the mobile Chuck robots in a warehouse where to go— we are a client of a service that handles warehouse operations.

The value that SFN provides is in gluing together a bunch of separate systems with some key logic living in that glue. There isn't much you can do with SFN without those dependencies around in some form.

How SFN Handles Dependencies

As software developers, we need quick feedback about whether an in-progress change is working as expected. And to know if something is working in SFN, that code generally needs to be validated alongside one or several of SFN’s dependencies. For example, if a developer is implementing a feature to display some text in the SFN app after a customer has placed an order, there’s no useful way to validate that change without also having Shopify available.

So the work of getting a useful development environment for SFN with Spin appears to be about looking at each dependency, figuring how to handle that dependency, and then implementing that decision. We have a few options for how to handle any particular dependency when running SFN in Spin:

  1. Run an instance of the dependency directly in the same Spin container.
  2. Mock the dependency.
  3. Use a shared running instance of the dependency, such as a staging or live test environment.

Given all the dependencies that SFN has, this seems like a decent amount of work for a three-person team.

But this is not the full extent of the problem—it’s just the foundation.

Once we added configuration to make some dependency or some functional flow of SFN work in Spin, another commit would often be added to SFN that nullifies that effort. For example, after getting some particular flow functioning in a Spin environment, the implementation of that flow might be rewritten with new dependencies that are not yet configured to work in Spin.

One apparent solution to this problem would be simply to pay more attention to what work is in flight in the SFN codebase and better prepare for upcoming changes.

But here’s the problem: It’s not just one or two flows changing. Across SFN, the internal implementation of functionality is constantly being improved and refactored. With over 150 SFN engineers deploying to production over 30 times a day, the SFN codebase doesn’t sit still for long. On top of that, Spin itself is constantly changing. And all of SFN’s dependencies are changing. For any dependencies that were mocked, those mocks will become stale and need to be updated.

The more we accomplished, the more functionality existed with the potential to stop working when something changes. And when one of those regressions occurred, we needed to interrupt the dependency we were working on solving in order to keep a previously solved flow functioning. The tension between making improvements and maintaining what you’ve already built is central to much of software engineering. Getting SFN working on Spin was just a particularly good example.

The Human Forces on the System

After recognizing the problem, we needed to step back and look at the forces acting on the system. What incentive structures and feedback loops are contributing to the situation?

In the case of getting SFN working on Spin, changes were happening frequently and those changes were causing regressions. Some of those changes were within our control (e.g., a change goes into SFN that isn’t configured to work in Spin), and some are less so (e.g., Spin itself changing how it uses certain inputs).

This led us to observe two powerful feedback loops that could be happening when SFN developers are working in Spin:

Two feedback loops: Loop of Happy Equilibrium and Spiral of Struggle.

Loop of Happy Equilibrium and Spiral of Struggle

If it’s painful to use Spin for SFN development, it’s less likely that developers will use Spin the next time they have to validate their work. And if a change hasn’t been developed and tested using Spin, maybe something about that change breaks a particular testing flow, and that causes another SFN developer to become frustrated enough to stop using Spin. And this cycle continues until SFN is no longer usable in Spin.

Alternatively, if it’s a great experience to use and validate work in Spin, developers will likely want to continue using the tool, which will catch any Spin-specific regressions before they make it into the main branch.

As you can imagine, it’s very difficult to move from the Spiral of Struggle into the positive Loop of Happy Equilibrium. Our solution is to try our best to dampen the force acting on the negative spiral while simultaneously propping up the force of the positive feedback loop. 

As the team focused on getting SFN working on Spin, our solution to this problem was to be very intentional about where we spent our efforts while asking the other SFN developers to endure a little pain and pitch in as we go through this transition. The SFN-on-Spin team narrowed its focus to just getting SFN to a basic level of functionality on Spin so that most developers could use it for the most common validation flows, and we prioritized fixing any bugs that disrupted those areas. This meant explicitly not working to get all SFN functionality running Spin, but just enough so that we could manage upkeep. And at the same time, we asked other SFN developers to use Spin for their daily work, even though it’s missing functionality they need or want. Where they feel frustrations or see gaps, we encouraged and supported them in adding the functionality they needed.

Breaking the cycle

Our hypothesis is that this is a temporary stage of transition to cloud development. If we’re successful, we’ll land in the Loop of Happy Equilibrium where regressions are caught before they’re merged, individuals add the missing functionality they need, and everyone ultimately has a fun time developing. They will feel confident about shipping their code.

Our job seems to be all about code and making computers do what we say. But many of the real-life challenges we face when working on a codebase are not apparent from code or architecture diagrams. Instead they require us to reflect on the forces operating on the humans that are building that software. And once we have an idea of what those forces might be, we can brainstorm how to disrupt or encourage the feedback loops we’ve observed.

Jen is a Staff Software Engineer at Shopify who's spent her career seeking out and building teams that challenge the status quo. In her free time, she loves getting outdoors and spending time with her chocolate lab.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

- 위키
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-09 03:47
浙ICP备14020137号-1 $방문자$