Uber has been on a multi-year journey to reimagine our infrastructure stack for a hybrid, multi-cloud world. The internal code name for this project is Crane. In this post we’ll examine the original motivation behind Crane, requirements we needed to satisfy, and some key features of our implementation. Finally, we’ll wrap up with some forward-looking views for Uber’s infrastructure.
In the Beginning…
In 2018 Uber was facing 3 major challenges with respect to our infrastructure:
- The size of our server fleet was growing rapidly, and our tooling and teams weren’t able to keep up. Many operations for managing servers were still manual. The automated tooling we did have was constantly breaking down. Both the manual operations and automated tooling were frequent outage culprits. In addition, operational load was taking a severe toll on teams, which meant less time for them to work on fundamental software fixes, leading to a vicious cycle.
- Fleet size growth came with the need to expand into more data centers/availability zones. What little tooling existed for turning up new zones was ad hoc, with the vast majority of the work being manual and diffused across many different infrastructure teams. Turning up a new zone took multiple months across dozens of teams and hundreds of engineers. In addition, circular dependencies between infrastructure components often led to awkward bootstrapping problems that were difficult to solve.
- Our server fleet consisted mostly of on-prem machines, with limited ability to take advantage of additional capacity that was available in the cloud. W...