SAE as a Crystal Ball- Interpretable Features Predict Cross-domain Transferability of LLMs without Training

如果无法正常显示，请先停止浏览器的去广告插件。

1. SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training 汇报人：Qi Zhang 美团搜索和推荐平台部

2. Post-training in Large Language Models Principles of Pre-training Self-supervised learning Principles of Post-training Supervised Fine-tuning Reinforcement Learning massive unlabeled data foundation models Long, reflective reasoning

3. Transferability of Post-training •Improvements on a target task often come at the expense of performance in other domains •Transferability across domains is difficult to predict.

4. Transferability of Post-training There exist several post-hoc analysis on the transferability of post-training

5. How to predict the transferability of post-training?

6. Sparse Autoencoders (SAE) • 𝐿 !"# 𝑊$ ; 𝑥% (, & & ) = 𝐸 | 𝑥% − 𝑥 % | = 𝐸' | 𝑊% 𝑥 − 𝑊$ 𝜎 𝑊$ 𝑊% 𝑥 |

7. What Can We Do with SAEs? Each SAE dimension is only activated by a certain natural concept, such as a mathematical definition,a physical property, or a linguistic pattern. Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. "Sparse autoencoders find highly interpretable features in language models." arXiv preprint arXiv:2309.08600 (2023).

8. How Does Representations Shift during SFT? •SFT process primarily affects only a limited portion of SAE features •The small subset of SAE features is closely associated with specific model capabilities

9. Can We Predict the Shifts before Training?

10. Can We Predict the Shifts before Training? To solve this challenge, we draw on the connection between supervised fine-tuning (SFT) and in-context learning (ICL). In-context learning (ICL) 𝐶 = { 𝑥! , 𝑦! , 𝑥" , 𝑦" , … , 𝑥# , 𝑦# } 𝑦 #$! ∼ 𝑝% 𝑦 𝑥 #$! , 𝐶

11. How to Predict the Downstream Performance? SAE-based Transferability Score (STS) Representation Shifts with ICL Step 1: Identify Shifted Dimensions 𝐷! = 𝑇𝑜𝑝𝑁 𝐸 #"! ℎ$ 𝑥% ; Θ − ℎ$ 𝑥& , 𝑦& , ⋯ , 𝑥' , 𝑦' , 𝑥% ; Θ , Step2: Find Correlations with Downstream Domains 𝑆𝑇𝑆 ()' = 𝐸 "* ! ∑ $ ∈," ℎ$ 2 𝑥 %; Θ SAE Activations 𝑆𝑇𝑆 -./ = 𝐸 "* ! ∑ $ ∈," ℎ$ 2 𝑥 % ; Θ − ℎ$ 𝑥& , 𝑦& , ⋯ , 𝑥' , 𝑦' , 𝑥% ; Θ . SAE Activation Shifts

12. The Performance of STS The Pearson correlation achieves above 0.75 across different settings!

13. Ablation Study •SAE Sparsity is Crucial •Robust to Layer Choice •Beats Probe Baseline

14. What can we do with STS? STS-Guided Data Mixture •Use STS to predict domains most affected by SFT •Domains with larger shifts need more additional data (e.g., Engineering) •STS-proportional data mixture leads to more balanced performance

15. What can we do with STS? Selective Fine-tuning •Fine-tune only linear layers on top-K shifted SAE dimensions (others frozen) •Improves math performance with very few trainable parameters •Works across different models and datasets

16. Explorations on RL • Directly applying STS in RL shows weak correlation with performance shifts • Reason: RL lacks ground-truth answers, making shifted feature estimation inaccurate • Using true shifted dimensions restores strong correlation

17. Takeaways • Post-training causes domain-dependent performance shifts Understanding how these shifts transfer across domains is critical but previously unclear. • STS predicts transferability before training By analyzing shifted SAE features, STS estimates performance changes without running post-training. • Empirical validation STS shows high correlation (>0.7) with actual performance shifts across models and domains. • Practical implications STS can guide post-training strategies, such as data mixture design, to mitigate uneven domain shifts.

18. Q&A