基于操作的SLOs
Anyone who has been following the topic of Site Reliability Engineering (SRE) has likely heard of Service Level Objectives (SLOs), and Service Level Indicators (SLIs). SLIs and SLOs are at the core of the SRE practices. They are fundamental to establish the balance between building new features on a product, shipping fast, or working on the reliability of that product. But they are not easy to get right. Zalando has gone through different iterations of defining SLOs, and we’re now in the process of maturing our latest iteration of SLO tooling. With this iteration, we are addressing fragmentation problems that are inherent to service based SLOs in highly distributed applications. Instead of defining reliability goals for each microservice, we are working with SLOs on Critical Busines Operations that are directly related to the user experience (e.g. "View Catalog", "Add Item to Cart"), rather than a specific application (Catalog Service, Cart Service). In this blog post we’re going to present our Operation Based SLOs, how we define them, the tooling around them, how they are part of our development process, and also how they contributed to a healthier on-call.
任何关注网站可靠性工程(SRE)主题的人都可能听说过服务水平目标(SLOs)和服务水平指标(SLI)。SLI和SLO是SRE实践的核心。它们是在产品上建立新功能、快速发货或在产品的可靠性方面建立平衡的基础。但它们并不容易得到正确的结果。Zalando已经经历了定义SLO的不同迭代,我们现在正处于成熟的SLO工具的最新迭代过程中。在这次迭代中,我们正在解决高度分布式应用中基于服务的SLO所固有的碎片化问题。我们不是为每个微服务定义可靠性目标,而是在直接与用户体验相关的关键业务操作(如"查看目录"、"将物品添加到购物车")上进行SLO,而不是在特定的应用(目录服务、购物车服务)上进行SLO。在这篇博文中,我们将介绍我们的基于操作的SLO,我们如何定义它们,围绕它们的工具,它们如何成为我们开发过程的一部分,以及它们如何为一个更健康的待命工作做出贡献。
The first iterations of defining SLOs
定义SLO的第一次迭代
To understand where we are right now, it’s important to understand how we got here. When we introduced SRE in Zalando back in 2016 we also introduced SLOs. At the time, we went with service based SLOs. Each microservice would have SLOs on whatever SLIs service owners defined (usually availability and latency), and they would get a weekly report of those SLOs, through a custom tool that was tightly coupled with our homebrew monitoring sys...