Scaling and Transferability of Annealing Strategies in Large Language Model Training
如果无法正常显示,请先停止浏览器的去广告插件。
1. Scaling and Transferability of Annealing Strategies
in Large Language Model Training
Siqi Wang, Zhengyu Chen, Teng Xiao, Zheqi Lv, Jinluan Yang,
Xunliang Cai, Jingang Wang, Xiaomeng Li
AAAI 2026 · Main Technical Track
Meituan · HKUST · AI2
美团LongCat团队
1
2. Outline
l
l
l
l
l
l
l
Motivation
Related Work
Forward-Momentum Scaling Law
Modeling Annealing Momentum
Training Loss Curve Fitting
Transfer of Optimal Annealing Ratio
Conclusion
2
3. Motivation
Motivation: Why Annealing Matters Beyond Tokens
Key observation:
Same model size + same token budget with different LR schedules / batch sizes
→ very different loss trajectories.
Problem:
1) Existing scaling laws focus on final loss;
2) But training dynamics are under-modeled.
3
4. Motivation
Empirical Observation: Steps vs Tokens
1) Loss curves diverge when plotted vs tokens
2) Loss curves align when plotted vs training steps (for batch size ≥ optimal threshold)
Implication
1) Training steps are a more stable tracker of optimization progress
4
5. Motivation
What Determines Training Dynamics?
We study three key factors:
1) Forward effect of training steps
2) Annealing effect of LR decay
3) Model size dependence
Goal:
1) build a unified, predictive model of loss curves
2) enable transferable annealing strategies
5
6. Related Work
From Scaling Laws to Full Training Curves
Classic scaling law:
Limitation:
1) Cannot distinguish different schedulers
2) Cannot model annealing behavior
➡ Need a dynamics-aware formulation
Scaling Laws for Neural Language Models. Kaplan et al. (arXiv:2001.08361)
Training Compute-Optimal Large Language Models. Hoffmann et al. (arXiv:2203.15556)
6
7. Related Work
l LLM training scaling laws can be
reliably reproduced using a simple
constant learning-rate plus
cooldown schedule (WSD Scheduler).
l A scaling law that models the entire
loss trajectory of LLM training by
integrating learning-rate annealing
into loss prediction.
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations. Hägele et al. (arXiv:2405.18392)
7
8. Related Work
l LLM training scaling laws can be
reliably reproduced using a simple
constant learning-rate plus
cooldown schedule (WSD Scheduler).
l A scaling law that models the entire
loss trajectory of LLM training by
integrating learning-rate annealing
into loss prediction.
Scaling Law with Learning Rate Annealing. Tissue et al. (arXiv:2408.11029)
8
9. Related Work
l Common learning-rate
schedules for large model
training (like cosine and WSD)
closely match performance
bounds from non-smooth
convex optimization theory.
Convex Theory & LR Scheduling for Large Models. Schaipp et al. (arXiv:2501.18965)
9
10. Forward-Momentum Scaling Law
We propose:
Where:
S = integral of learning rate over steps
M = accumulated annealing momentum
N = model size
Key idea:
Forward progress + annealing refinement
10
11. Modeling Annealing Momentum
Instead of multiplicative accumulation, we use Adam-style momentum integration.
Firstly, the momentum 𝑚! and second moment 𝑣" are updated at each step.
Then bias correction is applied to both moments:
Benefits:
The cumulative momentum 𝑀_𝑡 is updated as:
1) More stable numerically
2) Robust across batch sizes
3) Works with irregular step counts
11
12. Training Loss Curve Fitting
Across Model Sizes
Dense & MoE models
1) 50M → 1B Dense
2) 100M → 1.5B MoE
Results:
1) Mean prediction error < 2%
2) Works across architectures
Observation 1. Empirically, for model size dependence, we verified that loss curves
follow a power-law trend.
12
13. Training Loss Curve Fitting
Across Batch Sizes
Batch size: For B ≥ 𝐵#$" , step-based loss curves converge
Observation 2. Empirically, for the same model, a wide range of batch sizes (greater
than 𝑩𝒐𝒑𝒕 ) with the same learning rate schedule produce similar training curves vs
training steps.
An Empirical Model of Large-Batch Training, McCandlish et al. (arXiv:1812.06162)
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies. Hu et al. (arXiv:2404.06395)
13
14. Training Loss Curve Fitting
Across Schedulers
Schedulers: Cosine ↔ WSD loss curves can predict each other
Observation 3. Empirically, different schedulers affect the loss curve in a predictable
manner empirically, even though distinct curve patterns observed.
14
15. Transfer of Optimal Annealing Ratio
Annealing Ratio
Definition: The fraction of training spent in the decay phase of the learning rate
schedule.
For WSD scheduler:
Question:
How should 𝑅#$" depend on:
max LR? model size? dataset? training steps?
15
16. Transfer of Optimal Annealing Ratio
Across 𝐿𝑅()* Transferability
Holds across Dense & MoE.
Observation 1. Empirically, the optimal annealing ratio follows a power-law
relationship with maximum learning rates for both Dense and MoE models.
16
17. Transfer of Optimal Annealing Ratio
Transferability Across Models
We verify 𝑅#$" Scaling Across Model Sizes
Observation 2. Empirically, the optimal annealing ratio remains stable and transferable
across different model sizes for Dense and MoE.
17
18. Transfer of Optimal Annealing Ratio
Transferability Across Dataset
We verify 𝑅#$" Scaling Across Datasets
Observation 3. Empirically, the optimal annealing ratio remains consistent across
training and validation sets.
18
19. Transfer of Optimal Annealing Ratio
Across Steps Transferability
We verify 𝑅#$" Scaling Across Steps 𝑇.
Observation 4. Empirically, the relationship between the optimal annealing ratio and
training steps follows the power-law form.
19
20. Conclusion
Practical Takeaways
1) Annealing strategies follow predictable scaling laws (forward + annealing effects).
2) Training dynamics are transferable across models, schedulers, steps, and datasets.
3) Small models suffice to tune large models, enabling efficient, principled LLM training.
Code: https://github.com/xmed-lab/fm-annealing
20
21. Q&A
21
22. 更多技术干货
欢迎关注“美团技术团队”
欢迎关注
大模型团队岗位