Unveiling Super Experts in Mixture-of-Experts Large Language Models

如果无法正常显示，请先停止浏览器的去广告插件。

1. Unveiling Super Experts in Mixture-of-Experts Large Language Models 汇报人：Zunhai Su 美团计算和智能平台部

2. 目录 1. Introduction 2. Super Experts: Discovery and Localization 3. the Importance of Super Experts 4. Understanding the Impact of Super Experts Compression 5. Conclusion and Contributions 6. Q&A

3. 1. Introduction l MoE employ dynamic routing and sparse activation, demonstrating potential in enhancing LLMs, has led to state-of-the-art MoE LLMs l a significant challenge stems from their large parameter size, present obstacles for deployment l expert-level compression methods have been developed by leveraging the uneven importance of experts l analyzing expert importance not only facilitates model compression but also provides deeper insights into the inner workings of MoE LLMs

4. 1. Introduction l Is there a small subset of distinct experts that plays an exceptionally critical role in the underlying mechanisms of MoE LLMs? l Through comprehensive analysis of various open- source MoE LLMs, we consistently confirm the existence of such experts. l As shown in Figure, pruning just three experts from Qwen3-30B-A3B leads to a significant degradation, while randomly pruning other experts results in a considerably smaller impact. l We refer to these experts as Super Experts (SEs), and our comprehensive analysis provides progressively deeper insights into SEs.

5. 1. Introduction l Super Experts: Discovery and Localization l the Importance of Super Experts l Understanding the Impact of Super Experts Compression

6. 2. Super Experts: Discovery and Localization l Recent research has explored a distinct class of extreme activation outliers in LLMs, which appear in the hidden states between decoder layers and are known as massive activations (MAs). l Existing research has yet to clarify how these MAs arise in MoE LLMs. We surprisingly find that a small subset of experts consistently produces extreme activation outliers in the output of their down proj layers. These outliers are subsequently passed onto the hidden states via residual summation, leading to MAs.

7. 2. Super Experts: Discovery and Localization l To directly validate this mechanism, we also perform ablation experiments by dynamic pruning the SEs in Qwen3-30B-A3B. l As illustrated in Figure, pruning SEs from a single layer effectively eliminates their contribution to MAs. l When all SEs are pruned, MAs are completely eliminated, confirming that they are directly generated by SEs.

8. 2. Super Experts: Discovery and Localization l This criterion is motivated by the heavy-tailed distribution of expert output activation. l Through the proposed SE profiling tool, we identify the SEs across different models and inputs datasets.

9. 2. Super Experts: Discovery and Localization l SEs are consistently present across the investigated models, accounting for less than 0.5% of all experts.

10. 2. Super Experts: Discovery and Localization l After post-training processes, the distribution of SEs remains unchanged compared to the base model.

11. 2. Super Experts: Discovery and Localization l We also analyze SE distributions across several other datasets, the distribution of SEs remains highly stable, regardless of variations in the input data domain.

12. 3. the Importance of Super Experts l For non-reasoning models, pruning only a few SEs leads to significant degradation across all tasks, with average accuracy dropping by 21.68% to 27.21%. l In particular, for GSM8K, the degradation ranges from 52.71% to 74.15%. l In contrast, random pruning has a negligible impact, underscoring the crucial role of SEs.

13. 3. the Importance of Super Experts l For evaluating the importance of SEs in reasoning models, we select DeepSeek-R1 and the thinking mode of Qwen3-30B-A3B. l These benchmarks are: (i) General Tasks (ii) Math &Text Reasoning (iii) Agent & Coding. l The Pass@1 scores for most tasks drop to zero.

14. 3. the Importance of Super Experts l after pruning the SEs, the model consistently generated repetitive responses in nearly every test, continuing until it reached the maximum output length. l This behavior suggests that the model loses its ability to reason and solve problems entirely after SE pruning

15. 4. Understanding the Impact of Super Experts Compression l SEs constitute the fundamental source of systematic outliers in MoE LLMs Super Experts (源头) ------> Massive Activation (中间桥梁) ------>Attention Sink (实际作用）

16. 4. Understanding the Impact of Super Experts Compression l This routing behavior of SEs ensures that the attention sink token is activated at the SEs. l Evidence: the router scores assigned to SEs for the sink token are exceptionally large, for non- sink tokens the scores are more evenly distributed across expert.

17. 4. Understanding the Impact of Super Experts Compression l SEs constitute the fundamental source of systematic outliers in MoE LLMs Super Experts (源头) ------> Massive Activation (中间桥梁) ------>Attention Sink (实际作用）

18. 4. Understanding the Impact of Super Experts Compression l Figure visualizes the attention scores for several heads before and after pruning SEs, highlighting the complete disappearance of ASs following SE pruning.

19. 5. Conclusion and Contributions 1.we report, for the first time, the discovery and systematic investigation of Super Experts. 2.SEs are characterized by rare but extreme activation outliers in the output. Distribution of SEs is model-specific, data-agnostic, and remains unaffected bypost-training processes. 3.Pruning SEs has considerable impact on overall performance. 4.Super Experts (源头) -> Massive Activation (桥梁) -->Attention Sink (作用） 5.These findings advance the understanding of the internal dynamics of MoE LLMs, filling an important gap in the current knowledge.

20. Q&A