利用Atlas流媒体Eval改进警报功能

Ruchir Jha, Brian Harrington, Yingwu Zhao

Ruchir Jha,Brian Harrington,Yingwu Zhao

TL;DR

TL;DR

  • Streaming alert evaluation scales much better than the traditional approach of polling time-series databases.
  • 与传统的轮询时间序列数据库的方法相比,流式警报评估的扩展性要好很多。
  • It allows us to overcome high dimensionality/cardinality limitations of the time-series database.
  • 它使我们能够克服时间序列数据库的高维/cardinality限制。
  • It opens doors to support more exciting use-cases.
  • 它为支持更多令人兴奋的使用情况打开了大门。

Engineers want their alerting system to be realtime, reliable, and actionable. While actionability is subjective and may vary by use-case, reliability is non-negotiable. In other words, false positives are bad but false negatives are the absolute worst!

工程师希望他们的警报系统是实时的、可靠的和可操作的。虽然可操作性是主观的,并可能因使用情况而异,但可靠性是不可商量的。换句话说,假阳性是不好的,但假阴性绝对是最糟糕的!

A few years ago, we were paged by our SRE team due to our Metrics Alerting System falling behind — critical application health alerts reached engineers 45 minutes late! As we investigated the alerting delay, we found that the number of configured alerts had recently increased dramatically, by 5 times! The alerting system queried Atlas, our time series database on a cron for each configured alert query, and was seeing an elevated throttle rate and excessive retries with backoffs. This, in turn, increased the time between two consecutive checks for an alert, causing a global slowdown for all alerts. On further investigation, we discovered that one user had programmatically created tens of thousands of new alerts. This user represented a platform team at Netflix, and their goal was to build alerting automation for their users.

几年前,我们的SRE团队被呼唤,因为我们的指标警报系统落后了--关键的应用程序健康警报晚了45分钟才到达工程师手中!我们调查了警报延迟的原因!当我们调查警报延迟时,我们发现配置的警报数量最近急剧增加,增加了5倍!警报系统对Atlas进行查询,我们的时间序列数据库对每一个配置的警报查询都是通过cron进行的,并且看到了一个高的节流率和过多的重试和回退。这反过来又增加了一个警报的两次连续检查之间的时间,导致所有警报的全面减速。在进一步的调查中,我们发现一个用户以编程方式创建了数以万计的新警报。这个用户代表Netflix的一个平台团队,他们的目标是为他们的用户建立警报自动化。

While we were able to put out the immediate fire by disabling the newly created alerts, this incident raised some critical concerns around th...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.123.4. UTC+08:00, 2024-04-16 21:59
浙ICP备14020137号-1 $访客地图$