通过服务级别优先的负载削减提高Netflix的可靠性
Without prioritized load-shedding, both user-initiated and prefetch availability drop when latency is injected. However, after adding prioritized load-shedding, user-initiated requests maintain a 100% availability and only prefetch requests are throttled.
没有优先级负载丢弃时,无论是用户发起的请求还是预取请求在注入延迟时都会降低可用性。然而,添加了优先级负载丢弃后,用户发起的请求保持100%的可用性,只有预取请求被限制。
We were ready to roll this out to production and see how it performed in the wild!
我们已经准备好将其推向生产环境,并观察其在实际应用中的表现!
Real-World Application and Results
真实应用和结果
Netflix engineers work hard to keep our systems available, and it was a while before we had a production incident that tested the efficacy of our solution. A few months after deploying prioritized load shedding, we had an infrastructure outage at Netflix that impacted streaming for many of our users. Once the outage was fixed, we got a 12x spike in pre-fetch requests per second from Android devices, presumably because there was a backlog of queued requests built up.
Netflix的工程师们努力保持我们的系统可用性,在我们的解决方案经受住生产事故的考验之前,已经过了一段时间。在部署优先级负载削减几个月后,Netflix发生了一次基础设施故障,影响了许多用户的流媒体服务。故障修复后,我们从Android设备上收到了12倍的预取请求每秒的激增,可能是因为积压的请求积累了。
Spike in Android pre-fetch RPS
Android预取RPS激增
This could have resulted in a second outage as our systems weren’t scaled to handle this traffic spike. Did prioritized load-shedding in PlayAPI help us here?
这可能导致第二次停机,因为我们的系统无法扩展以处理这次流量激增。在PlayAPI中进行的优先级负载削减是否对我们有帮助?
Yes! While the availability for prefetch requests dropped as low as 20%, the availability for user-initiated requests was > 99.4% due to prioritized load-shedding.
是的!由于优先级负载丢弃的原因,尽管预取请求的可用性降低到了20%,但用户发起的请求的可用性仍然超过99.4%。
Availability of pre-fetch and user-initiated requests
预取和用户发起的请求的可用性
At one point we were throttling more than 50% of all requests but the availability of user-initiated requests continued to be > 99.4%.
有一段时间我们限制了超过50%的所有请求,但用户发起的请求的可用性仍然保持在99.4%以上。
Based on the success of this approach, we have created an internal library to enable services to perform prioritized load shedding based on pluggable utilization measures, with multiple prior...