以慢为快--Slack的CI/CD的断路器

What happens when your distributed service has challenges with stampeding herds of internal requests? How do you prevent cascading failures between internal services? How might you re-architect your workflows when naive horizontal or vertical scaling reaches their respective limits?

当你的分布式服务遇到内部请求群的挑战时,会发生什么?你如何防止内部服务之间出现级联故障?当天真的横向或纵向扩展达到各自的极限时,你如何重新架构你的工作流程?

These were the challenges facing Slack engineers during their day-to-day development workflows in 2020. Multiple internal services that engineers used were stretched to their limits, leading to cascading failures between services. Cascading failures are positive feedback loops where one part of the system fails at scale, leading to a queue in an adjacent system that results in another system failing due to scale. For several years, internal tooling and services struggled to keep up with 10% month-over-month growth in CI/CD requests from a combination of growth in 1) internal headcount and 2) complexity of services and testing. Development across Slack slowed due to these failures, leaving internal tooling and infrastructure engineers scrambling to restore service. Engineers managed to restore service in the short-term by…

这些是Slack工程师在2020年的日常开发工作流程中所面临的挑战。工程师们使用的多个内部服务被拉伸到了极限,导致服务之间出现级联故障。级联故障是正反馈回路,系统的一个部分在规模上出现故障,导致相邻系统的排队,导致另一个系统因规模而出现故障。几年来,内部工具和服务很难跟上每月10%的CI/CD请求增长,这是因为1)内部人员的增长和2)服务和测试的复杂性。由于这些故障,整个Slack的开发速度放缓,使得内部工具和基础设施工程师争相恢复服务。工程师们设法在短期内恢复了服务,通过...

  • Scaling appliances like Github Enterprise to the largest hardware available at the time in AWS (limiting future vertical scaling).
  • 将Github Enterprise等设备扩展到AWS中当时最大的硬件(限制了未来的垂直扩展)。
  • Scaling one service with more nodes to handle a new peak load (only to discover that this led to failures in another service in the infrastructure).
  • 用更多的节点来扩展一项服务,以处理新的峰值负载(却发现这导致了基础设施中另一项服务的失败)。

Of course, these solutions would only work until we reached a new peak load in internal services. We needed a new way to think about this problem.

当然,这些解决方案只能在我们的内部服务达到一个新的峰值负荷之前发挥作用。我们需要一种新的方式来思考这个问题。

This article describes how Sla...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.148.2. UTC+08:00, 2025-12-14 10:33
浙ICP备14020137号-1 $访客地图$