Why you should develop a correction of error (COE)

出处：aws.amazon.com

存档：存档

译文：中文

摘要

Application reliability is critical. Service interruptions result in a negative customer experience, thereby reducing customer trust and business value. One best practice that we have learned at Amazon, is to have a standard mechanism for post-incident analysis. This lets us analyze a system after an incident in order to avoid reoccurrences in the future. These incidents also help us learn more about how our systems and processes work. That knowledge often leads to actions that help other incident scenarios, not just the prevention of a specific reoccurrence. The mechanism is called the Correction of Error (COE) process. Although post-event analysis is part of the COE process, it is different from a postmortem, because the focus is on corrective actions, not just documenting failures. This post will explain why you should start implementing the COE mechanism after an incident, and its components to help you get started.

阅读原文

顶尖义于 2022-10-25 分享

10843

关联话题： #Amazon

欢迎在评论区写下你对这篇文章的看法。

Why you should develop a correction of error (COE)

Why you should develop a correction of error (COE)

摘要

评论

文库