为什么你应该制定纠错程序（COE）？

Application reliability is critical. Service interruptions result in a negative customer experience, thereby reducing customer trust and business value. One best practice that we have learned at Amazon, is to have a standard mechanism for post-incident analysis. This lets us analyze a system after an incident in order to avoid reoccurrences in the future. These incidents also help us learn more about how our systems and processes work. That knowledge often leads to actions that help other incident scenarios, not just the prevention of a specific reoccurrence. The mechanism is called the Correction of Error (COE) process. Although post-event analysis is part of the COE process, it is different from a postmortem, because the focus is on corrective actions, not just documenting failures. This post will explain why you should start implementing the COE mechanism after an incident, and its components to help you get started.

应用程序的可靠性是至关重要的。服务中断会导致负面的客户体验，从而降低客户的信任和商业价值。我们在亚马逊学到的一个最佳做法是，有一个标准的事故后分析机制。这让我们可以在事件发生后对系统进行分析，以避免未来再次发生。这些事件也帮助我们更多地了解我们的系统和流程如何运作。这些知识往往会导致有助于其他事件发生的行动，而不仅仅是防止特定事件的再次发生。这种机制被称为纠错（COE）过程。尽管事后分析是COE过程的一部分，但它与事后分析不同，因为其重点是纠正行动，而不仅仅是记录故障。这篇文章将解释为什么你应该在事件发生后开始实施COE机制，以及它的组成部分来帮助你开始实施。

Why should you do COE?

你为什么要做COE？

The COE process consists of a post-event analysis. It is imperative that the negative impact caused by the event be mitigated before the COE process begins. This lets you:

COE过程包括事件后的分析。当务之急是在COE过程开始之前，减轻事件造成的负面影响。这让你

Deep dive into the sequence of events leading up to the event
深入研究导致事件发生的顺序
Find the root cause of the problem and identify remediation actions
找到问题的根本原因并确定补救行动
Analyze the impact of the incident to the business and our customers
分析事件对企业和客户的影响
Identify and track action items that prevent problem re-occurrences
识别和跟踪防止问题再次发生的行动项目

What a COE is not

COE不是什么

It is not a process for finding whom to blame for the problem: The purpose of a COE is to facilitate maximum visibility into the areas that are most in need of improvement. Creating a culture that rewards people for surfacin...