Unveiling Super Experts in Mixture-of-Experts Large Language Models
如果无法正常显示,请先停止浏览器的去广告插件。
1. Unveiling Super Experts in Mixture-of-Experts
Large Language Models
汇报人:Zunhai Su
美团计算和智能平台部
2. 目录
1. Introduction
2. Super Experts: Discovery and Localization
3. the Importance of Super Experts
4. Understanding the Impact of Super Experts
Compression
5. Conclusion and Contributions
6. Q&A
3. 1. Introduction
l MoE employ dynamic routing and sparse
activation, demonstrating potential in enhancing
LLMs, has led to state-of-the-art MoE LLMs
l a significant challenge stems from their large
parameter size, present obstacles for
deployment
l expert-level compression methods have been
developed by leveraging the uneven
importance of experts
l analyzing expert importance not only facilitates
model compression but also provides deeper
insights into the inner workings of MoE LLMs
4. 1. Introduction
l Is there a small subset of distinct experts that
plays an exceptionally critical role in the
underlying mechanisms of MoE LLMs?
l Through comprehensive analysis of various open-
source MoE LLMs, we consistently confirm the
existence of such experts.
l As shown in Figure, pruning just three experts
from Qwen3-30B-A3B leads to a significant
degradation, while randomly pruning other
experts results in a considerably smaller impact.
l We refer to these experts as Super Experts (SEs),
and our comprehensive analysis provides
progressively deeper insights into SEs.
5. 1. Introduction
l Super Experts: Discovery and Localization
l the Importance of Super Experts
l Understanding the Impact of Super Experts
Compression
6. 2. Super Experts: Discovery and Localization
l Recent research has explored a distinct class of extreme activation outliers in LLMs, which appear in the
hidden states between decoder layers and are known as massive activations (MAs).
l Existing research has yet to clarify how these MAs arise in MoE LLMs. We surprisingly find that a small
subset of experts consistently produces extreme activation outliers in the output of their down proj layers.
These outliers are subsequently passed onto the hidden states via residual summation, leading to MAs.
7. 2. Super Experts: Discovery and Localization
l To directly validate this mechanism, we also perform ablation experiments by dynamic pruning the SEs in
Qwen3-30B-A3B.
l As illustrated in Figure, pruning SEs from a single layer effectively eliminates their contribution to MAs.
l When all SEs are pruned, MAs are completely eliminated, confirming that they are directly generated by
SEs.
8. 2. Super Experts: Discovery and Localization
l This criterion is motivated by the heavy-tailed
distribution of expert output activation.
l Through the proposed SE profiling tool, we identify
the SEs across different models and inputs datasets.
9. 2. Super Experts: Discovery and Localization
l SEs are consistently present across the
investigated models, accounting for less
than 0.5% of all experts.
10. 2. Super Experts: Discovery and Localization
l After post-training processes, the
distribution of SEs remains unchanged
compared to the base model.
11. 2. Super Experts: Discovery and Localization
l We also analyze SE distributions across
several other datasets, the distribution of
SEs remains highly stable, regardless of
variations in the input data domain.
12. 3. the Importance of Super Experts
l For non-reasoning models,
pruning only a few SEs leads to
significant degradation across all
tasks, with average accuracy
dropping by 21.68% to 27.21%.
l In particular, for GSM8K, the
degradation ranges from 52.71%
to 74.15%.
l In contrast, random pruning has a
negligible impact, underscoring
the crucial role of SEs.
13. 3. the Importance of Super Experts
l For evaluating the importance of
SEs in reasoning models, we select
DeepSeek-R1 and the thinking
mode of Qwen3-30B-A3B.
l These benchmarks are: (i) General
Tasks (ii) Math &Text Reasoning (iii)
Agent & Coding.
l The Pass@1 scores for most tasks
drop to zero.
14. 3. the Importance of Super Experts
l after pruning the SEs, the model
consistently generated repetitive
responses in nearly every test,
continuing until it reached the
maximum output length.
l This behavior suggests that the
model loses its ability to reason
and solve problems entirely after
SE pruning
15. 4. Understanding the Impact of Super Experts Compression
l SEs constitute the fundamental source of systematic outliers in MoE LLMs
Super Experts (源头) ------> Massive Activation (中间桥梁) ------>Attention Sink (实际作用)
16. 4. Understanding the Impact of Super Experts Compression
l This routing behavior of SEs
ensures that the attention sink
token is activated at the SEs.
l Evidence: the router scores
assigned to SEs for the sink token
are exceptionally large, for non-
sink tokens the scores are more
evenly distributed across expert.
17. 4. Understanding the Impact of Super Experts Compression
l SEs constitute the fundamental source of systematic outliers in MoE LLMs
Super Experts (源头) ------> Massive Activation (中间桥梁) ------>Attention Sink (实际作用)
18. 4. Understanding the Impact of Super Experts Compression
l Figure visualizes the attention scores for several heads before and after pruning SEs, highlighting the
complete disappearance of ASs following SE pruning.
19. 5. Conclusion and Contributions
1.we report, for the first time, the discovery and systematic investigation of Super Experts.
2.SEs are characterized by rare but extreme activation outliers in the output. Distribution of SEs
is model-specific, data-agnostic, and remains unaffected bypost-training processes.
3.Pruning SEs has considerable impact on overall performance.
4.Super Experts (源头) -> Massive Activation (桥梁) -->Attention Sink (作用)
5.These findings advance the understanding of the internal dynamics of MoE LLMs, filling an
important gap in the current knowledge.
20. Q&A