This blog post discusses the strategies that Slack uses to manage the lifecycle (development, support, and eventual retirement) of infrastructure projects, through the lens of the migration through three successive internal “platform” offerings.
Circa 2020, our Cloud Engineering team (now evolved into multiple teams responsible for narrower aspects) was responsible for managing our Infrastructure-as-a-Service providers, compute environments like Chef and Kubernetes, and a large variety of related systems.
This team had a broad remit, often with multiple systems and technologies serving overlapping purposes, and this broad remit caused several problems for us.
These overlaps meant we would spend valuable time recreating new features in multiple projects, instead of writing new functionality just once.
Retiring older technologies to focus on newer ones was challenging, because these older technologies still served important responsibilities, and we did not have a clear process for migration of those responsibilities to newer technologies.
We also could not easily and consistently communicate with our peers about what they could expect if they chose a specific technology: Would we support them? Address bugs they found? Surprise them with a “deprecation” warning a week after they went live?
Not only did we not have clear answers to these questions, the un-clear answer a peer received would vary based on who on the team they talked to.
Our peers didn’t know which technologies they should — or could — use.
Goals of a technology lifecycle
We set out to develop a model to help us track and talk about the technologies we supported, with a few goals in mind:
We wanted to standardize the technology status and the expectations that we communicated to our peers, regardless of who they were talking to, but also across the technologies in our inventory.
If we had two technologies that were slated for retirement, we wanted to handle incoming requests for them in the same way, and have concrete and executable plans for making forward progress to retire them.
Without a framework to pin our technology status to, when our peers asked for advice about which they should select, our answers tended to be based in opinion — whether a choice is “bad” or “good”, from the perspective of the answerer.
We wanted instead to be able to concisely communicate important information about a technology — how we support it, what its future is, whether it is production ready or not, and what our peers might expect if they adopt it and then have feature requests or bug reports.
This way, our customers could make their own decisions with factual information that’s important to their own planning, rather than hard-to-weigh value judgments, or scattered and inconsistent factoids.
Autonomy and control
Finally, the lifecycle would be a tool for us to retire technologies that we currently managed but for which there were (or would soon be) better options.
Ultimately, our ability to fully retire a technology depends on the engineering teams that depend on it choosing to migrate off of it. Our peers in Engineering have the same full plates we do and the same planning cycles and constraints. So, by ensuring that the lifecycle is designed with these constraints in mind, we could be certain that as we move a technology through the lifecycle, our peers have ample time and forewarning to plan their own migrations or deprecations.
Our project lifecycle has six stages. We expect most technologies to proceed through all six stages in the fullness of time; though many might skip stages for reasons discussed below.
In each stage, we document what our peers can expect for:
- Acceptance of new customers (“not yet”, “if you’re brave”, “yes, please!”, “please don’t”, “prohibited”)
- Feature requests (do we accept them and how we prioritize them or reserve resources for them)
- Bug reports (do we accept them, and how we prioritize them or reserve resources for them)
- Security and compliance reports (we always accept them, but at some stages, the resolution is to shut the service down rather than fix forward)
- Documentation quality (how up to date and accurate you can expect documentation to be at a given stage)
- Rules about progression through the lifecycle (e.g., Active systems will never go straight to Retirement)
Technologies are assigned to a stage primarily through team consensus — we agree on: whether a system is feature complete and sufficiently bug free for broad use; whether we have the resources to provide support for it at a corresponding level; or in later stages, whether there is a feature-complete replacement and if we’ve provided sufficient communication about retirement.
Movement into and through later stages includes dialogue with the internal users of a system, for their awareness and to ensure that technologies we think are a viable replacement do in fact meet their needs.
Without further ado, the following stage descriptions are taken directly from our internal documentation, with some minor tweaks to wording or genericization of service names.
Stage 1: Alpha
An experiment. A new project or system that Cloud Engineering is working on. A pre-prototype, something that will very likely be discarded, but the “v2” might be Beta.
General use is strongly discouraged.
None. We reserve the right to summarily decline any bug reports or feature requests for Alpha projects, other than Security/Compliance Bugs.
Alpha projects will either advance to Beta or Retired, depending on the outcome of the experiment.
- Hack day projects.
- Work undertaken to refine our understanding for future planning.
Stage 2: Beta
A technology that Cloud Engineering is building that is intended to become an Active technology, but is not yet ready for general consumption and is considered Beta. A Prototype.
Self-service production use is strongly discouraged. Limited production use in close consultation with Cloud Engineering. Available for non-production use to brave friends and early adopters.
Supported only for the purpose of gathering feedback to advance the project. Documentation is likely sparse to nonexistent; Slack discussions are likely the best source of information. Any requests (other than Security/Compliance) related to Beta systems will be classified as low priority and/or scheduled on the project roadmap.
Beta systems are on track to become Active. However, it’s always possible that these systems will instead go straight to Maintenance, or, very rarely, Deprecated.
Stage 3: Active
An Active technology is production grade, with staffed support from Cloud Engineering. Active technologies are our flagship technologies and products that we hope all of our friends across engineering will adopt.
Active technologies are available for production use.
Cloud Engineering sets aside resources to handle bug reports and feature requests for Active technologies. Requests will be prioritized according to normal procedures.
Most Active technologies also have reserved resourcing to provide users of the systems with hands-on support and consultation. (Note that Cloud Engineering does not perform operational tasks for users of our services; but when you use Active technologies, we set aside time to work hand-in-hand with you to achieve your objectives.)
Active technologies will eventually transition to Maintenance. Very rarely, an Active technology will transition straight to Deprecated, but never directly to Retired.
Stage 4: Maintenance
Maintenance technologies are older technologies and projects. In general, a technology is moved to Maintenance when its replacement is late in the Beta stage, or very early in the Active stage.
Maintenance technologies are available for production use. New use cases are discouraged but not prohibited — Active (or sometimes Beta) are preferred.
Cloud Engineering sets aside resources to handle bug reports (but not feature requests) for Maintenance technologies. Requests will be prioritized according to normal procedures.
Cloud Engineering may summarily decline feature requests for Maintenance technologies.
Cloud Engineering will not in general provide hands-on support and consultation for Maintenance technologies, depending on the maturity of alternative systems. Cloud Infrastructure may scope such support to only designated office hours.
Maintenance technologies may one day progress to Deprecated, or revived to Active if resources and prioritization allow.
Stage 5: Deprecated
Deprecated technologies are those for which there is a feature-complete Active alternative.
Frequently, these technologies have hard stop dates (for example, Ubuntu releases) that are external to Slack.
Existing users of Deprecated technologies should migrate to Active technologies as quickly as possible.
New use cases are prohibited from adopting Deprecated technologies.
Minimal. Documentation is likely to be outdated or incorrect. Bug reports (other than Security/Compliance) related to Deprecated systems will be classified as low priority.
Cloud Engineering reserves the right to summarily decline feature requests for Deprecated systems.
Deprecated systems will remain so for enough time that reasonable users will be able to migrate without undue haste. This typically means at least two full quarters after Deprecation (e.g., if a system is Deprecated during Q1, it will remain in this lifecycle state throughout the entirety of Q2 and Q3).
After this time, and at Cloud Engineering’s discretion, the technology will be Retired.
Stage 6: Retired
Retired systems are no longer supported and are not intended for use. Remnants may still exist, and it’s even possible that entire Retired systems will still function. However, Retired systems may fail at any moment, and if they do, the remnants will be deleted/terminated, not repaired.
Not available for any purpose. Do not use it for new use cases. Migrate existing use cases immediately.
None. The only response even to Security/Compliance requests will be to delete the related infrastructure. No bug reports or feature requests will be accepted.
End of the line, folks. Get off here or you’re sleeping in the bus yard.
We use the Technology Lifecycle framework as a communications and planning tool. It helped us migrate through three iterations of a platform-ized Compute offering by making it easier to communicate expectations with current and prospective customers of these platforms.
Comparing our resource allocations to the technology’s stage on the lifecycle also helped us see where we were over-investing or under-investing based on the projected future of a technology.
We developed our first internal platform, BuiltBySlack; or “BBS”. This framework was rudimentary, consisting of a set of parameterized Terraform modules and Chef cookbooks used to deploy and run services bundled in an internally-developed packaging format.
BBS helped us prove out the use case and appetite for a platform offering while also serving some pieces of critical infrastructure.
After realizing some value from the abstraction and standardization BBS provided, we began working on a Kubernetes-based offering, “Bedrock”.
This first iteration of this, “Bedrock Classic”, used many of the same infrastructure components as our current generation container platform, but required service owners to take responsibility for many more aspects of the container build and deploy process: Writing shell scripts to build images, push them to registries, and writing their own Kubernetes resources to run the images.
We introduce the technology lifecycle. Bedrock Classic is described as Active in the first draft. BuiltBySlack is considered “maintenance”. Its eventual replacement, Bedrock YAML, is Beta.
BuiltBySlack is moved to Deprecated, since we and the remaining BBS customers agree that Bedrock is at feature parity.
Bedrock YAML graduates to Active, now that it’s ready for full production use. The Cloud Engineering team makes this determination, based on the experience of our Beta customers and the results of internal production readiness assessments.
All BuiltBySlack customers have migrated off of BBS and onto Bedrock, and so BBS (and the packaging format it used) are moved to Retired.
After many months of maintaining feature parity between Bedrock YAML and Bedrock Classic, that older iteration is moved to Deprecated. We communicate to the customers of Bedrock Classic that they should plan to migrate to Bedrock YAML.
Through all of the stages of platform maturity (and a similar but more regimented process for, e.g., Ubuntu releases), a documented and concrete Lifecycle and Maturity Model for these products made it easy for us to plan for and communicate about our work; and ultimately to stop investing in technologies that did not fit in our vision of the future.
Curtailing investments in systems that are slated to be deprecated and retired lets us focus a small and agile team onto high-impact work to support a substantially larger overall engineering team.
Interested in shaping the future of technology at Slack? We're hiring! Apply now