死胡同还是数据金矿？来自两年AI驱动的事后分析的投资洞察

TL;DR: We adopted LLMs as an intelligent SRE assistant to analyze thousands of postmortems, transforming them from "dead ends" into "data goldmines." This solution automates the identification of recurring incident patterns, particularly in our datastores: Postgres, AWS DynamoDB, AWS ElastiCache, AWS S3 and Elasticsearch. While AI effectively speeds up analysis, uncovers hidden hotspots and investment opportunities, human curation remains crucial for accuracy, fostering trust, and addressing limitations like hallucinations and surface attribution errors. Despite this, we acknowledge the significant potential of AI with SRE that empowers engineering teams with this capability to facilitate rapid decision making.

简而言之: 我们采用 LLM 作为智能 SRE 助手，分析数千个事后审查，将其从“死胡同”转变为“数据金矿”。该解决方案自动识别重复事件模式，特别是在我们的数据存储中：Postgres、AWS DynamoDB、AWS ElastiCache、AWS S3 和 Elasticsearch。虽然 AI 有效加快了分析速度，揭示了隐藏的热点和投资机会，但人工策划仍然对准确性至关重要，促进信任，并解决诸如幻觉和表面归因错误等局限性。尽管如此，我们承认 AI 与 SRE 的巨大潜力，使工程团队能够利用这一能力促进快速决策。

Introduction

介绍

At Zalando, a group of colleagues is looking after the datastores in our Tech Radar, wanted to explore:

在 Zalando，一组同事正在关注我们 Tech Radar 中的数据存储，想要探索：

“What if every system outage could make our entire infrastructure smarter?”
“如果每次系统故障都能让我们的整个基础设施变得更智能，会怎样？”

Going forward, we took a Site Reliability Engineering (SRE) perspective to determine valuable learning from failures and postmortems. For us a critical aspect of SRE is the feedback loop where systems, teams, and investments evolve. So far, our traditional approach to the feedback loop is human-centric analysis about incident effects, the root cause analysis (RCA), and the corrective measures implemented to prevent future occurrences. This is a solid technique for immediate reactive learning but it does not work well for retrospective analysis of years of past incident reports at the company scale.

展望未来，我们采取了站点可靠性工程（SRE）的视角，以确定从失败和事后分析中获得的有价值的学习。对我们来说，SRE的一个关键方面是反馈循环，其中系统、团队和投资不断演变。到目前为止，我们对反馈循环的传统方法是以人为中心的分析，关注事件影响、根本原因分析（RCA）以及为防止未来发生而实施的纠正措施。这是一种针对即时反应学习的可靠技术，但对于公司规模的多年过去的事件报告的回...