构建 Prometheus:后端聚合如何实现吉瓦级 AI 集群

By Jalpa Patel, Ankur Singh, Hany Morsy

作者:Jalpa PatelAnkur SinghHany Morsy

Once it’s complete our AI cluster, Prometheus, will deliver 1-gigawatt of capacity to enhance and enable new and existing AI experiences across Meta products. Prometheus’ infrastructure will span several data center buildings in a single larger region, interconnecting tens of thousands of GPUs.

我们的 AI 集群一旦完成,Prometheus,将提供 1 吉瓦的容量,以增强和启用 Meta 产品中的新旧 AI 体验。Prometheus 的基础设施将跨越单个更大区域内的多个数据中心建筑,互连数万 GPU。

A key piece of scaling and connecting this infrastructure is backend aggregation (BAG), which we use to seamlessly connect GPUs and data centers with robust, high-capacity networking. By leveraging modular hardware, advanced routing, and resilient topologies, BAG ensures both performance and reliability at unprecedented scale

扩展和连接此基础设施的关键部分是 backend aggregation (BAG),我们使用它通过健壮、高容量的网络无缝连接 GPU 和数据中心。通过利用模块化硬件、高级路由和弹性拓扑,BAG 在前所未有的规模上确保性能和可靠性。

As our AI clusters continue to grow, we expect BAG to play an important role in meeting future demands and driving innovation across Meta’s global network.

随着我们的 AI 集群不断增长,我们预计 BAG 将在满足未来需求并推动 Meta 全球网络创新方面发挥重要作用。

What Is Backend Aggregation?

什么是后端聚合?

BAG is a centralized Ethernet-based super spine network layer that primarily functions to interconnect multiple spine layer fabrics across various data centers and regions within large clusters. Within Prometheus, for example, the BAG layer serves as the aggregation point between regional networks and Meta’s backbone, enabling the creation of mega AI clusters. BAG is designed to support immense bandwidth needs, with inter-BAG capacities reaching the petabit range (e.g., 16-48 Pbps per region pair).

BAG 是一个集中式的基于以太网的超级脊层网络层,主要功能是互连多个数据中心和区域内的大型集群中的多个脊层架构。例如,在 Prometheus 中,BAG 层作为区域网络和 Meta 主干网之间的聚合点,从而实现巨型 AI 集群的创建。BAG 设计用于支持巨大的带宽需求,BAG 间容量达到拍比特级别(例如,每个区域对 16-48 Pbps)。

We use backend aggregation (BAG) to interconnect data center regions to share compute and other resources into large clusters.

我们使用 backend aggregation (BAG) 来互连数据中心区域,以将计算和其他资源共享到大型集群中。

How BAG...

开通本站会员,查看完整译文。

ホーム - Wiki
Copyright © 2011-2026 iteam. Current version is 2.153.0. UTC+08:00, 2026-02-13 03:10
浙ICP备14020137号-1 $お客様$