在 Meta 规模上驯服广告推理的尾部利用

Tail utilization is a significant system issue and a major factor in overload-related failures and low compute utilization.
尾部利用率是一个重要的系统问题，也是超载相关故障和低计算利用率的主要因素。
The tail utilization optimizations at Meta have had a profound impact on model serving capacity footprint and reliability.
Meta的尾部利用率优化对模型服务容量占用和可靠性产生了深远影响。
Failure rates, which are mostly timeout errors, were reduced by two-thirds; the compute footprint delivered 35% more work for the same amount of resources; and p99 latency was cut in half.
故障率（主要是超时错误）减少了三分之二；计算占用资源相同的情况下，计算占用资源提供了35%的额外工作量；p99延迟减少了一半。

The inference platforms that serve the sophisticated machine learning models used by Meta’s ads delivery system require significant infrastructure capacity across CPUs, GPUs, storage, networking, and databases. Improving tail utilization – the utilization level of the top 5% of the servers when ranked by utilization– within our infrastructure is imperative to operate our fleet efficiently and sustainably.

为了有效和可持续地运营我们的服务器群，Meta广告投放系统所使用的复杂机器学习模型的推理平台需要跨CPU、GPU、存储、网络和数据库等多个基础设施资源。提高尾部利用率——按利用率排名前5%的服务器的利用率水平——对于我们的基础设施至关重要。

With the growing complexity and computational intensity of these models, as well as the strict latency and throughput requirements to deliver ads, we’ve implemented system optimizations and best practices to address tail utilization. The solutions we’ve implemented for our ads inference service have positively impacted compute utilization in our ads fleet in several ways, including increasing work output by 35 percent without additional resources, decreasing timeout error rates by two-thirds, and reducing tail latency at p99 by half.

随着这些模型的复杂性和计算强度的增加，以及交付广告所需的严格延迟和吞吐量要求，我们已经实施了系统优化和最佳实践来解决尾部利用率的问题。我们为广告推理服务实施的解决方案在多个方面对我们的广告服务器群的计算利用率产生了积极影响，包括在不增加额外资源的情况下增加了35%的工作输出量，将超时错误率降低了三分之二，并将p99的尾部延迟减少了一半。

How Meta’s ads model inference service works

Meta广告模型推断服务的工作原理

When placing an ad, client requests are routed to the inference service to get predictions. A single request from a client typically results in multiple model inferences be...