使用 warp decode 实现更好的 MoE 模型推理

Most MoE inference systems organize the token generation path around experts. This mirrors how routing works and has been the standard approach at scale. For small-batch decode on Blackwell GPUs, however, we found that organizing the kernel around outputs rather than experts works better. We call this approach “warp decode.”

大多数 MoE inference 系统围绕 experts 组织 token generation path。这反映了 routing 的工作方式，并且一直是大规模的标准方法。然而，对于 Blackwell GPUs 上的 small-batch decode，我们发现围绕 outputs 而非 experts 组织 kernel 效果更好。我们称这种方法为“warp decode。”

We arrived at warp decode by thinking about what the maximum achievable memory bandwidth for MoE decode on Blackwell actually is. That led us to flip the parallelism axis entirely. Instead of assigning warps to experts, we assign each warp to a single output value (neuron).

我们通过思考 Blackwell 上 MoE decode 的最大可实现 memory bandwidth 实际是什么而得出 warp decode。这导致我们完全翻转 parallelism axis。与其将 warps 分配给 experts，不如将每个 warp 分配给单个 output value (neuron)。

Kernels that improve both performance and accuracy are rare, and warp decode is one of them. On Blackwell, it delivers a 1.84x throughput improvement while also improving accuracy with outputs 1.4x closer to a full FP32 reference. This speeds up the research and training pipeline for Composer, letting us improve the model faster and ship new versions more often.

同时提升性能和准确性的 Kernels 很少见，而 warp decode 就是其中之一。在 Blackwell 上，它提供了 1.84x 的吞吐量改进，同时准确性也得到提升，输出结果与完整 FP32 参考值的接近度提高 1.4 倍。这加速了 Composer 的研究和训练流程，让我们能够更快地改进模型并更频繁地发布新版本。

The conventional MoE path

传统的 MoE 路径

Modern MoE models route each token through a subset of specialized expert networks, selecting, for example, 8 out of 128 at a given layer. The standard implementation organizes all computation around those experts by collecting the tokens each expert needs, running the math, and reassembling the results.

现代 MoE 模型将每个 token 路由到一个专属专家网络的子集，例如在给定层中从 128 个中选择 8 个。标准实现围绕这些专家组织所有计算，通过收集每个专家所需的 token、运行计算并重新组装结果。

This works well for prefill and large batches, where the shared work per expert amortizes the overhead of organ...