通过服务级别优先的负载削减提高Netflix的可靠性

Without prioritized load-shedding, both user-initiated and prefetch availability drop when latency is injected. However, after adding prioritized load-shedding, user-initiated requests maintain a 100% availability and only prefetch requests are throttled.

没有优先级负载丢弃时,无论是用户发起的请求还是预取请求在注入延迟时都会降低可用性。然而,添加了优先级负载丢弃后,用户发起的请求保持100%的可用性,只有预取请求被限制。

We were ready to roll this out to production and see how it performed in the wild!

我们已经准备好将其推向生产环境,并观察其在实际应用中的表现!

Real-World Application and Results

真实应用和结果

Netflix engineers work hard to keep our systems available, and it was a while before we had a production incident that tested the efficacy of our solution. A few months after deploying prioritized load shedding, we had an infrastructure outage at Netflix that impacted streaming for many of our users. Once the outage was fixed, we got a 12x spike in pre-fetch requests per second from Android devices, presumably because there was a backlog of queued requests built up.

Netflix的工程师们努力保持我们的系统可用性,在我们的解决方案经受住生产事故的考验之前,已经过了一段时间。在部署优先级负载削减几个月后,Netflix发生了一次基础设施故障,影响了许多用户的流媒体服务。故障修复后,我们从Android设备上收到了12倍的预取请求每秒的激增,可能是因为积压的请求积累了。

Spike in Android pre-fetch RPS

Android预取RPS激增

This could have resulted in a second outage as our systems weren’t scaled to handle this traffic spike. Did prioritized load-shedding in PlayAPI help us here?

这可能导致第二次停机,因为我们的系统无法扩展以处理这次流量激增。在PlayAPI中进行的优先级负载削减是否对我们有帮助?

Yes! While the availability for prefetch requests dropped as low as 20%, the availability for user-initiated requests was > 99.4% due to prioritized load-shedding.

是的!由于优先级负载丢弃的原因,尽管预取请求的可用性降低到了20%,但用户发起的请求的可用性仍然超过99.4%。

Availability of pre-fetch and user-initiated requests

预取和用户发起的请求的可用性

At one point we were throttling more than 50% of all requests but the availability of user-initiated requests continued to be > 99.4%.

有一段时间我们限制了超过50%的所有请求,但用户发起的请求的可用性仍然保持在99.4%以上。

Based on the success of this approach, we have created an internal library to enable services to perform prioritized load shedding based on pluggable utilization measures, with multiple prior...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-22 19:33
浙ICP备14020137号-1 $访客地图$