uVitals – 一个异常检测和警报系统

Every day, millions of people rely on Uber to move from place to place and have food and groceries delivered. Uber depends on the reliability of its internal systems and the accuracy of data to power its platform. A glitch in its systems can result in a poor user experience and/or a loss in revenue. Major system issues that affect the reliability of our services are detected and mitigated quickly. However, there are several minor issues that take a longer time to detect and mitigate. Such minor issues can collectively result in poor user experiences and revenue loss over time. This is where uVitals comes in, as it surfaces these issues and anomalies when they begin to occur.

每天,数百万人依赖Uber从一个地方到另一个地方,并且通过Uber送餐和送货。Uber依赖其内部系统的可靠性和数据的准确性来支持其平台。系统故障可能导致用户体验差和/或收入损失。影响我们服务可靠性的主要系统问题会被快速检测和缓解。然而,有一些较小的问题需要更长时间来检测和缓解。这些较小的问题随着时间的推移可能会导致用户体验差和收入损失。这就是uVitals的用武之地,因为它在这些问题和异常开始发生时就会将其提供给我们。

In today’s fast-paced digital world, where businesses depend on uninterrupted services, preventing downtime and disruptions is crucial. 

在当今快节奏的数字化世界中,企业依赖于不间断的服务,防止停机和中断至关重要。

Image

Figure 1: Failure frequency vs Time to Detect

图1:故障频率与检测时间

Let’s take a look at the distribution of outages, where the x-axis represents the time it takes to detect an issue in hours and the y-axis depicts the frequency of failures. What we observe is interesting: issues with higher failure frequencies tend to be detected earlier and resolved within a day, thanks to our reliability and availability systems. 

让我们来看一下故障的分布,其中x轴表示检测问题所需的时间(以小时为单位),y轴表示故障的频率。我们观察到有趣的现象:故障频率较高的问题往往能够更早地被检测出来,并在一天内得到解决,这要归功于我们的可靠性和可用性系统。

However, the story takes a different turn when we explore the domain of less frequent issues. Here, the time to detection stretches, often surpassing a day, as these challenges are typically handled through incident response processes. But what about the long tail of issues that lurk in the shadows, sometimes remaining undetected until they cause chaos? For these, traditional strategies may not suffice.

然而,当我们探索不常见问题领域时,故事就会有所不同。在这里,检测时间延长,通常超过一天,因为这些挑...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.124.0. UTC+08:00, 2024-05-03 02:53
浙ICP备14020137号-1 $访客地图$