格鲁特:eBay基于事件图的根本原因分析方法

As large-scale distributed microservice systems continue to power more of today’s businesses, it has become even more important to detect any anomalies in these systems and efficiently diagnose their root causes to ensure high system availability. 

随着大规模分布式微服务系统继续为当今更多的企业提供动力,检测这些系统的任何异常并有效地诊断其根本原因以确保系统的高可用性变得更加重要。

To diagnose any root causes, existing approaches have usually consisted of capturing information about the state of the system by instrumentation or monitoring metrics. Then, with techniques such as machine learning or heuristics, these approaches abstract the root cause analysis (RCA) problem into logical constraints or a dependency/causality graph. Graph models are popular since they can represent the dependencies/causalities between different components in a system. Existing work has already attempted approaches based on probabilistic graphical models to describe the states of the system. 

为了诊断任何根本原因,现有的方法通常包括通过仪器或监测指标来获取系统状态的信息。然后,通过机器学习或启发式等技术,这些方法将根本原因分析(RCA)问题抽象为逻辑约束或依赖/因果图。图形模型很受欢迎,因为它们可以表示系统中不同组件之间的依赖性/因果关系。现有的工作已经尝试了基于概率图形模型的方法来描述系统的状态。

How Groot Helps Overcome Microservice Architecture Challenges 

Groot如何帮助克服微服务架构的挑战

Microservice architecture has been proposed and quickly adopted by many large companies to improve the scalability, development agility and reusability of their business systems. However, despite these undeniable benefits, microservice architecture also brings three new challenges in reliability:

微服务架构已经被提出,并迅速被许多大公司采用,以提高其业务系统的可扩展性、开发敏捷性和可重复使用性。然而,尽管有这些不可否认的好处,微服务架构也带来了可靠性方面的三个新挑战。

  1. Operational Complexity: For large-scale systems, there are typically two major categories of Site Reliability Engineers (SREs), namely centered/infrastructure SREs and embedded/domain SREs. The former category of SREs focuses on a reliable infrastructure, but might be less familiar with specific services and therefore might not be able to quickly adapt to new changes. The latter category of SREs has domain and product knowledge, but likely spends additional effort on duplicate infrastructure work...
开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.125.1. UTC+08:00, 2024-05-17 10:54
浙ICP备14020137号-1 $访客地图$