A Systematic Approach to Reducing Technical Debt
While technical debt is a recurring issue in software engineering, the case of the Merchant Orders team within Zalando Direct was a an outlier as, due to a lack of a clearly defined process, technical debt more or less only ever accumulated. When I joined this team in autumn 2020 as its new engineering lead, the technical debt backlog had entries dating back to 2018. In this article, I describe the process we set up in Q1/2021 in order to regain control of our technical debt. While the situation in your own team may not be quite as dire, you may nonetheless find some aspects of this blog post useful to adopt. Our backlog of technical debt tickets used to be in excess of 70, with no end in sight. With the adoption of the methodology described in this article, we have already shipped more than ten features or improvements over the course of eight weeks, i.e. four sprints. For the first time in three years, i.e. ever since my team started tracking technical debt, we are reducing it.
This article is written from a managerial perspective and has Engineers and Engineering Managers as its target audience, though I hope that engineers of all levels find value in this article. Furthermore, I can only encourage any software engineer reading this article to approach their lead if ever-growing technical debt is an issue in their team. There is a non-zero chance that they will appreciate you raising the issue, considering that all of us are aware that technical debt is a serious problem. If you do not pay it down, you will get more technical debt on top for free, until your only option is a complete rewrite. This is quite similar to compound interest driving debtors into bankruptcy in the real world. Obviously, we would like to avoid such an outcome.
An excerpt from my team’s technical-debt backlog as of April 2021. As you can see, there are items from 2018 and 2019 on it.
Technical debt, Known and Unknown
Using the vocabulary of the Johari window, you can probably identify plenty of “known known” technical debt in your codebase. However, some technical debt constitutes an “unknown unknown”, i.e. technical debt we do not know that we have. In our case, we had a long backlog of known technical debt, with many dozens of entries. Given that we have over a dozen services to maintain, this is probably not even a particularly frightening number. However, there is also technical debt that you are completely unaware of. This may seem counter-intuitive, in particular if you subscribe to the notion of being able to perfectly design services in advance, as well as once and for all eternity. Yet, this is not a caricature, considering that you can encounter non-technical leads who hold rather similar beliefs. In some circumstances, this could even be a perfectly valid position to hold, for instance in static environments.
There are at least two sources of unknown technical debt. First, there are problems with your services that you simply have not yet identified. This can happen easily because once you agree on a design and subsequently carry out its implementation, you may not question any decisions the team has agreed on. This can of course mean that there are drawbacks in your design or implementation that someone with a fresh pair of eyes, for instance a new joiner, may be able to spot. Second, technology is a fast-moving field. This means that today’s cutting-edge design-patterns, development processes, testing strategies, or even programming languages and paradigms may get superseded. Your current best practices replaced your previous set of best practices one by one, and there are new developments that will one day make you wonder why anybody ever thought that a hitherto valid approach was ever a good idea. Of course, there is also the problem that we sometimes need to deliver features quickly to seize a business opportunity, which may lead to sub-optimal design and implementation decisions.
Not all change is positive, however. As much as we engineers may pride ourselves on our objectivity, our industry is also driven by fads. This is such a big issue that a company like Gardner makes money by selling their analyses about where on the “hype cycle” certain technologies are. Sometimes, we also regress as an industry, for instance by adopting technologies that are popular but less powerful. Yet, if they are being pushed by corporations with an annual marketing budget of many hundreds of millions of dollars, they can get a lot of traction in industry. Any of your services might look much differently if it was rewritten today. As a practical consequence, I think you should take the time to re-review your existing services and look for improvements, but, if possible, with a very critical view toward buzzwords du jour. Even TeX, one of the arguably most mature software products in the world, receives fixes to this very day. Its first version was released about two decades ago. Taking this into account, it is probably not an entirely implausible assumption that your services could be improved as well. On a related note, Zalando has formal processes in place for selecting technologies as well as adopting new technologies. This is certainly helpful for engineering leaders, yet it cannot address the problem that some technologies fall out of favor over time due to shortcomings.
As we create software solutions in a highly dynamic environment where both customer requirements and technologies can change, a semi-regular review of any of your services may uncover areas of improvement. All of that should be categorized as (hitherto unknown) technical debt. A very welcome consequence of such an exercise is that your engineers will gain greater familiarity with their services. This is particularly valuable if your services need to be reliable anytime. Preferably, each engineer on your on-call rotation should have very detailed knowledge of your services, so thoroughly studying the source code of your existing service will be very helpful to them.
Motivating your Engineers
In management theory, a popular concept is Theory X/Theory Y. These two show up in pairs. According to Theory X, people only work because they need money and, if they could get away with it, they would prefer to not work at all. In contrast, Theory Y posits that people are intrinsically motivated, care about their work, and want to advance in their career. Reality is probably somewhere in-between. However, as a leader, the problem is how to get people to want to work on technical debt. In our case, the problem was that the backlog had tickets on it that were three years old, which seems to imply a lack of motivation to work on such tickets.
As leaders we can of course simply tell people what to work on (Theory X). The problem, however, is that people tend to be more productive if they work on tickets they really do want to work on (Theory Y). Furthermore, my experience as an engineer was that work on technical debt can be both fulfilling, as well as open up new opportunities. Consequently, I use a Theory Y approach with my team, stressing the benefits of this kind of work. Please note that this is not in any way a cynical approach. A good part of my growth as an engineer was due to resolving hairy technical problems, oftentimes with a focus on performance improvements. In one of my internships I was given the task of increasing the performance of an artificial neural network, and this work led to me later on getting hired in a very competitive field. I also highlighted to my team that work on technical debt can sometimes be easily quantified. An engineer’s CV certainly looks better with hard data on percentages of performance increases or space reductions. Examples are: “Reduced weekly AWS hosting fees by $500 by evaluating resource requirements” (this is an actual result of our work) or “reduced space requirements of one of our databases by 12% by optimizing data types and removing redundant information.”
The Technical-Debt Rotation
My team already has several rotations in place. Thus, I set up technical debt as another rotation. I aim to give my team autonomy in their work, so my proposal was the following: all engineers take turns in the technical-debt rotation, and one iteration lasts for one week. In practice, this means that on every Monday an engineer should spend some time on identifying technical debt they want to work on. This can either be known technical debt, i.e. one or more tickets from the technical-debt tracker, or unknown technical debt. For the latter, my suggestion is to pick one of our many services, study the source code, and look for improvements. This should lead to a number of additional tickets. Preferably, an engineer identifying possible improvements of an existing service should also do the corresponding work. This is particularly the case when we only have a hypothesis that requires some work to test it.
I want the engineers on the technical-debt rotation to work on tickets related to technical debt before taking on any tickets from our regular backlog, which is of course considered during the planning meeting. In terms of the time commitment, I am rather flexible. I would like the engineer on the rotation to spend at least one day working on technical debt. However, there are situations where a bigger commitment may be warranted. This is particularly the case with larger subprojects, which is detailed in the next section. You may miss that I have not addressed the issue of urgency as, clearly, not all technical debt is created equal. Pressing issues we tend to address as soon as possible. We commonly do not even classify it as technical debt but instead as a necessary bug fix or an “operations” issue. Nonetheless, some of our accumulated technical debt is merely nice-to-resolve. My advice to fellow leaders would be to keep an eye on what your team is working on by tracking the technical-debt tickets your team closes. There should be a healthy mix of relative importance. If not, you will have to address this, perhaps in a separate session for backlog refinement. I would not advise you to rank all technical-debt tickets by urgency and simply assign them, however, for reasons specified in the previous section.
We also have a simple system in place for categorizing technical debt where we use the two metrics "complexity" and "impact", and rank both on a scale from one to five. In our case, these estimations are initially done by the engineer who adds entries to the tech-debt backlog, but they are reviewed intermittently. I think a good starting point is picking a few items that could be considered low-hanging fruit, i.e. work that pairs relatively low complexity with moderate to high impact. You may want to encourage your engineers to also tackle more complex work with a medium to high impact. You may also find that some of the technical debt is not worth resolving at the current point in time as the impact would be low to non-existent. Those you may want to save for a less busy time, for instance the code freeze before Cyber Week.
Capitalizing Technical Debt
One of the duties of software engineering leads is to ensure that the work their team performs is properly capitalized. This means that any software we create that increases our digital assets should also be added to our financial assets. In turn, this reduces our tax liabilities. Maintenance work, however, cannot be capitalized as it is instead considered an expense. A collection of technical debt tickets could constitute a mini-project that can be capitalized, however. One example would be a migration to new infrastructure or a significant rewrite that leads to performance improvements. Admittedly, packaging technical-debt tickets into a project may be an overly idealistic scenario. Yet, it is a possible outcome. In our team’s case, we have recently identified a number of issues with our Scala code base, due to an over-reliance on object-oriented programming constructs. If we resolved them, we would have a more maintainable system; we also predict an improvement in performance as there are many instances where objects are used instead of primitive types. Similarly, you may be able to identify a group of technical-debt tickets, provided your backlog is long enough, that could constitute a small project.
Results
The team has been following the technical-debt rotation as described in this article for about six months. Feedback from the team has been positive. Among others, the engineers remarked that it adds variety to their work or that they appreciate the increased autonomy. Of course, the latter will only be the case for as long as there is a large enough backlog of technical-debt tickets to choose from. At some point, hopefully, we will have reduced our backlog significantly, and then we will have to rely on the intrinsic motivation of wanting to better understand an existing system by diving deeper into implementation details or the satisfaction of improving the performance or design of a service. From the perspective of an engineering leader, my end goal is to pay down as much technical debt as possible. In fact, the ideal size of our technical-debt backlog would be zero. This is a distant goal, but we have taken successful steps towards it. First, I wanted to reduce the rate of increase of the backlog. We achieved this within the first two weeks. If you preside over a technical-debt backlog that has only been growing for three years, it is already satisfying to see that it is no longer growing as quickly. The next step was to keep the number of tickets on the backlog steady, which we reached soon afterwards. Now we are at the point where the total number of tickets on our technical-debt backlog is, possibly for the first time ever, declining. The team is very happy about it. One year from now, I expect us to have drastically reduced our technical-debt backlog.
If you're interested in Engineering Management, consider joining our teams as Engineering Manager at Zalando.