这不是文化问题:Airbnb 告警开发的升级

Observability as Code (OaC) — defining alerts, dashboards, and SLOs via code rather than UI — is table stakes for large engineering organizations. With OaC, observability adopts software development’s version control, code review, and testing processes, achieving the same level of discipline as a result. At Airbnb’s scale (thousands of engineers and services), this is the foundation that lets teams ship confidently while maintaining the reliability our guests and hosts depend on.

Observability as Code (OaC) —— 通过代码而非 UI 定义警报、仪表板和 SLOs —— 对于大型工程组织来说是基本要求。借助 OaC,可观测性采用了软件开发的版本控制、代码审查和测试流程,从而实现了相同水平的纪律。在 Airbnb 的规模(数千工程师和服务)下,这是让团队自信交付的同时维护客人与房东依赖的可靠性的基础。

Yet there’s a critical gap in most OaC workflows. While we bring rigor to alert definitions through code review and version control, the actual behavior of those alerts often can’t be validated until they’re live. Production becomes the proving ground. Problems surface either as noise that erodes trust or silence that hides real incidents.

然而,大多数 OaC 工作流程中存在一个关键差距。虽然我们通过代码审查和版本控制为警报定义带来了严谨性,但这些警报的实际行为往往无法在上线之前验证。生产环境成为验证场。问题要么以噪音形式出现,侵蚀信任,要么以沉默形式隐藏真实事件。

This tolerance of high alert noise might appear to be a culture problem, but we realized it was actually a gap in the developer workflow. We solved it by building accessible, fast feedback loops to preview, validate, and surface actionable insights on alert behavior before PR submission. With these changes, development cycles collapsed from weeks to minutes, and we successfully migrated 300,000 alerts from a vendor to Prometheus, a feat that wouldn’t have been possible otherwise.

这种对高警报噪音的容忍可能看起来是文化问题,但我们意识到这实际上是开发者工作流程中的一个差距。我们通过构建可访问的、快速反馈循环来解决它,这些循环允许在 PR 提交之前预览、验证警报行为并提供可操作的洞察。随着这些变化,开发周期从数周缩短到几分钟,我们成功地将 300,000 个警报从供应商迁移到 Prometheus,这在其他情况下是不可能实现的。

Airbnb’s OaC North Star

Airbnb 的 OaC North Star

Our Observability as Code North Star is for product teams to receive out-of-the-box, best-practice monitoring from platform teams. When a product engineer adopts Kubernetes, a service framework, or a database, they shoul...

开通本站会员,查看完整译文。

Accueil - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.0. UTC+08:00, 2026-03-06 06:47
浙ICP备14020137号-1 $Carte des visiteurs$