重新思考用于大语言模型推理的强化学习的采样标准:能力-难度对齐视角
如果无法正常显示,请先停止浏览器的去广告插件。
1. Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning:
A Competence-Difficulty Alignment Perspective
重新思考用于大语言模型推理的强化学习的采样标准:能力-难度对齐视角
AAAI 2026
美团计算与智能平台部
2. Background - RL for LLM Reasoning
Recent LLMs (e.g., DeepSeek-R1, OpenAI o1) achieve strong reasoning ability via Reinforcement
Learning.
RL methods such as GRPO amplify reasoning without human annotation.
However, RL training is expensive and hard to scale.
A major bottleneck lies in low sample efficiency during rollout phase.
Inefficient sampling leads to:
• Excessive zero-gradient samples
• Wasted GPU computation
• Slow convergence
3. Motivation
Unstable and Biased Estimations of Problem Difficulty
Existing strategies are inspired by Curriculum Learning:
•Curriculum Sampling (offline difficulty labels)
•Prioritized Sampling (pass-rate based)
Core assumption: pass rate ≈ problem difficulty.
However, this assumption is flawed:
•Pass rate is highly unstable
•Single-step observations are noisy
Difficulty estimation becomes biased.
Pass rate fluctuation curves for two problems
4. Motivation
Misalignment Between Model Competence and Problem Difficulty
Problem difficulty is not absolute. Difficulty is relative to model competence
Two failure modes in RL sampling:
•
Too easy/hard problems → pass rate 1/0 → zero gradients
Optimal training requires: Problems that match the model’s current capability
When a group of samples consists entirely of correct or entirely of incorrect solutions, the calculated advantage becomes
zero. As a result, the gradients for these samples also become zero, contributing nothing to the model's training.
5. Methodology - CDAS Framework
Competence-Difficulty Alignment Sampling (CDAS)
Core ideas:
•Stable difficulty estimation via historical aggregation
•Explicit modeling of model competence
•Sampling based on competence–difficulty alignment
CDAS is integrated into RL as a fixed-point system
6. Methodology - Defining Model Competence
Model competence represents the overall solving ability at a given step.
Defined as the negative average difficulty over the dataset:
•
Competence increases as problems become easier for the model
Intuition:
•A single scalar summarizes the model’s learning stage
•Enables global alignment decisions
Formula:
7. Methodology - Defining Problem Difficulty
Instantaneous difficulty is noisy
CDAS distinguishes:
•Instantaneous difficulty: performance gap at a step
•Stable difficulty: historical aggregation over time
Difficulty is computed as:
•Expected performance − Actual performance
•Historical averaging smooths fluctuations
Leads to robust and unbiased difficulty estimation
Pass Rate vs Step
8. Methodology - Alignment-Based Sampling
Alignment defined as: Distance between competence and problem difficulty.
Sampling strategy:
•
Select problems closest to the competence frontier
Use symmetric sampling:
•Slightly easier problems
•Slightly harder problems
The system forms a fixed-point iteration guaranteed convergence under mild conditions
9. Methodology - RL with CDAS
Note that since |B| is usually much smaller than the size of the training set, performing a full update of
problem difficulties at each step will lead to heavy computational overhead.
Instead, for problem x , we record the number of times it is sampled as tn(x) and update its difficulty
only when it is sampled.
10. Experiment
Task: Mathematical reasoning RL
Dataset: MATH
Model: Qwen2.5-7B
Algorithm: GRPO
Baselines:
•Random Sampling
•Curriculum Sampling
•Prioritized Sampling
•Dynamic Sampling
Performance comparison across different sampling methods on various math benchmarks.
Metrics are Avg@32 for AIME and standard accuracy for others. We present the best
results in bold and the second with underline.
11. Experiment
CDAS achieves the best average accuracy across benchmarks. Outperforms strong baselines, including
Dynamic Sampling.
Achieves comparable or better performance with:
•~50% fewer training steps
•~57% reduction in training overhead vs Dynamic Sampling
CDAS achieves the best performance while demonstrating significant efficiency advantages
compared to the strong Dynamic Sampling baseline.
12. Analysis: Sample Utility
CDAS implicitly reduces zero-gradient samples.
Compared to baselines:
• Fewer problems with pass rate = 0 or 1
Leads to:
• Higher effective gradient signal
• Faster learning
The proportion of zero-gradient problems in the
sampled batch.
13. Analysis: Difficulty vs Pass rate
Pass rate alone cannot distinguish learning trajectories.
CDAS difficulty incorporates:
• Historical performance
• Learning dynamics
Problems with identical final pass rates
can have very different difficulties
and CDAS captures this nuance.
Problem difficulty vs. pass rate in CDAS.
14. Generalization
Generalizes to:
•Code generation tasks
•Larger models (Qwen2.5-14B)
•
Different architectures (OctoThinker)
Accuracy comparison on LiveCodeBench v5
Achieves:
•Consistent gains over Random Sampling
•Comparable performance to Dynamic Sampling
with lower cost
Generalization Performance Across
Different Architectures and Model Sizes
15. Conclusion
We rethink RL sampling from a competence–difficulty alignment perspective
CDAS provides:
•Stable difficulty estimation
•Dynamic alignment with model capability
•A principled fixed-point formulation
Achieves:
•Higher accuracy
•Better efficiency
•Strong generalization
🔑 Key takeaway: Efficient RL requires matching the right problems to the right model at the right time.
16. Q&A
17. 更多技术干货
欢迎关注“美团技术团队”