Scaling and Transferability of Annealing Strategies in Large Language Model Training

如果无法正常显示，请先停止浏览器的去广告插件。

1. Scaling and Transferability of Annealing Strategies in Large Language Model Training Siqi Wang, Zhengyu Chen, Teng Xiao, Zheqi Lv, Jinluan Yang, Xunliang Cai, Jingang Wang, Xiaomeng Li AAAI 2026 · Main Technical Track Meituan · HKUST · AI2 美团LongCat团队 1

2. Outline l l l l l l l Motivation Related Work Forward-Momentum Scaling Law Modeling Annealing Momentum Training Loss Curve Fitting Transfer of Optimal Annealing Ratio Conclusion 2

3. Motivation Motivation: Why Annealing Matters Beyond Tokens Key observation: Same model size + same token budget with different LR schedules / batch sizes → very different loss trajectories. Problem: 1) Existing scaling laws focus on final loss; 2) But training dynamics are under-modeled. 3

4. Motivation Empirical Observation: Steps vs Tokens 1) Loss curves diverge when plotted vs tokens 2) Loss curves align when plotted vs training steps (for batch size ≥ optimal threshold) Implication 1) Training steps are a more stable tracker of optimization progress 4

5. Motivation What Determines Training Dynamics? We study three key factors: 1) Forward effect of training steps 2) Annealing effect of LR decay 3) Model size dependence Goal: 1) build a unified, predictive model of loss curves 2) enable transferable annealing strategies 5

6. Related Work From Scaling Laws to Full Training Curves Classic scaling law: Limitation: 1) Cannot distinguish different schedulers 2) Cannot model annealing behavior ➡ Need a dynamics-aware formulation Scaling Laws for Neural Language Models. Kaplan et al. (arXiv:2001.08361) Training Compute-Optimal Large Language Models. Hoffmann et al. (arXiv:2203.15556) 6

7. Related Work l LLM training scaling laws can be reliably reproduced using a simple constant learning-rate plus cooldown schedule (WSD Scheduler). l A scaling law that models the entire loss trajectory of LLM training by integrating learning-rate annealing into loss prediction. Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations. Hägele et al. (arXiv:2405.18392) 7

8. Related Work l LLM training scaling laws can be reliably reproduced using a simple constant learning-rate plus cooldown schedule (WSD Scheduler). l A scaling law that models the entire loss trajectory of LLM training by integrating learning-rate annealing into loss prediction. Scaling Law with Learning Rate Annealing. Tissue et al. (arXiv:2408.11029) 8

9. Related Work l Common learning-rate schedules for large model training (like cosine and WSD) closely match performance bounds from non-smooth convex optimization theory. Convex Theory & LR Scheduling for Large Models. Schaipp et al. (arXiv:2501.18965) 9

10. Forward-Momentum Scaling Law We propose: Where: S = integral of learning rate over steps M = accumulated annealing momentum N = model size Key idea: Forward progress + annealing refinement 10

11. Modeling Annealing Momentum Instead of multiplicative accumulation, we use Adam-style momentum integration. Firstly, the momentum 𝑚! and second moment 𝑣" are updated at each step. Then bias correction is applied to both moments: Benefits: The cumulative momentum 𝑀_𝑡 is updated as: 1) More stable numerically 2) Robust across batch sizes 3) Works with irregular step counts 11

12. Training Loss Curve Fitting Across Model Sizes Dense & MoE models 1) 50M → 1B Dense 2) 100M → 1.5B MoE Results: 1) Mean prediction error < 2% 2) Works across architectures Observation 1. Empirically, for model size dependence, we verified that loss curves follow a power-law trend. 12

13. Training Loss Curve Fitting Across Batch Sizes Batch size: For B ≥ 𝐵#$" , step-based loss curves converge Observation 2. Empirically, for the same model, a wide range of batch sizes (greater than 𝑩𝒐𝒑𝒕 ) with the same learning rate schedule produce similar training curves vs training steps. An Empirical Model of Large-Batch Training, McCandlish et al. (arXiv:1812.06162) MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. Hu et al. (arXiv:2404.06395) 13

14. Training Loss Curve Fitting Across Schedulers Schedulers: Cosine ↔ WSD loss curves can predict each other Observation 3. Empirically, different schedulers affect the loss curve in a predictable manner empirically, even though distinct curve patterns observed. 14

15. Transfer of Optimal Annealing Ratio Annealing Ratio Definition: The fraction of training spent in the decay phase of the learning rate schedule. For WSD scheduler: Question: How should 𝑅#$" depend on: max LR? model size? dataset? training steps? 15

16. Transfer of Optimal Annealing Ratio Across 𝐿𝑅()* Transferability Holds across Dense & MoE. Observation 1. Empirically, the optimal annealing ratio follows a power-law relationship with maximum learning rates for both Dense and MoE models. 16

17. Transfer of Optimal Annealing Ratio Transferability Across Models We verify 𝑅#$" Scaling Across Model Sizes Observation 2. Empirically, the optimal annealing ratio remains stable and transferable across different model sizes for Dense and MoE. 17

18. Transfer of Optimal Annealing Ratio Transferability Across Dataset We verify 𝑅#$" Scaling Across Datasets Observation 3. Empirically, the optimal annealing ratio remains consistent across training and validation sets. 18

19. Transfer of Optimal Annealing Ratio Across Steps Transferability We verify 𝑅#$" Scaling Across Steps 𝑇. Observation 4. Empirically, the relationship between the optimal annealing ratio and training steps follows the power-law form. 19

20. Conclusion Practical Takeaways 1) Annealing strategies follow predictable scaling laws (forward + annealing effects). 2) Training dynamics are transferable across models, schedulers, steps, and datasets. 3) Small models suffice to tune large models, enabling efficient, principled LLM training. Code: https://github.com/xmed-lab/fm-annealing 20

21. Q&A 21

22. 更多技术干货欢迎关注“美团技术团队” 欢迎关注大模型团队岗位