重新思考用于大语言模型推理的强化学习的采样标准：能力-难度对齐视角

如果无法正常显示，请先停止浏览器的去广告插件。

1. Rethinking the Sampling Criteria in Reinforcement Learning for LLM Reasoning: A Competence-Difficulty Alignment Perspective 重新思考用于大语言模型推理的强化学习的采样标准：能力-难度对齐视角 AAAI 2026 美团计算与智能平台部

2. Background - RL for LLM Reasoning Recent LLMs (e.g., DeepSeek-R1, OpenAI o1) achieve strong reasoning ability via Reinforcement Learning. RL methods such as GRPO amplify reasoning without human annotation. However, RL training is expensive and hard to scale. A major bottleneck lies in low sample efficiency during rollout phase. Inefficient sampling leads to: • Excessive zero-gradient samples • Wasted GPU computation • Slow convergence

3. Motivation Unstable and Biased Estimations of Problem Difficulty Existing strategies are inspired by Curriculum Learning: •Curriculum Sampling (offline difficulty labels) •Prioritized Sampling (pass-rate based) Core assumption: pass rate ≈ problem difficulty. However, this assumption is flawed: •Pass rate is highly unstable •Single-step observations are noisy Difficulty estimation becomes biased. Pass rate fluctuation curves for two problems

4. Motivation Misalignment Between Model Competence and Problem Difficulty Problem difficulty is not absolute. Difficulty is relative to model competence Two failure modes in RL sampling: • Too easy/hard problems → pass rate 1/0 → zero gradients Optimal training requires: Problems that match the model’s current capability When a group of samples consists entirely of correct or entirely of incorrect solutions, the calculated advantage becomes zero. As a result, the gradients for these samples also become zero, contributing nothing to the model's training.

5. Methodology - CDAS Framework Competence-Difficulty Alignment Sampling (CDAS) Core ideas: •Stable difficulty estimation via historical aggregation •Explicit modeling of model competence •Sampling based on competence–difficulty alignment CDAS is integrated into RL as a fixed-point system

6. Methodology - Defining Model Competence Model competence represents the overall solving ability at a given step. Defined as the negative average difficulty over the dataset: • Competence increases as problems become easier for the model Intuition: •A single scalar summarizes the model’s learning stage •Enables global alignment decisions Formula:

7. Methodology - Defining Problem Difficulty Instantaneous difficulty is noisy CDAS distinguishes: •Instantaneous difficulty: performance gap at a step •Stable difficulty: historical aggregation over time Difficulty is computed as: •Expected performance − Actual performance •Historical averaging smooths fluctuations Leads to robust and unbiased difficulty estimation Pass Rate vs Step

8. Methodology - Alignment-Based Sampling Alignment defined as: Distance between competence and problem difficulty. Sampling strategy: • Select problems closest to the competence frontier Use symmetric sampling: •Slightly easier problems •Slightly harder problems The system forms a fixed-point iteration guaranteed convergence under mild conditions

9. Methodology - RL with CDAS Note that since |B| is usually much smaller than the size of the training set, performing a full update of problem difficulties at each step will lead to heavy computational overhead. Instead, for problem x , we record the number of times it is sampled as tn(x) and update its difficulty only when it is sampled.

10. Experiment Task: Mathematical reasoning RL Dataset: MATH Model: Qwen2.5-7B Algorithm: GRPO Baselines: •Random Sampling •Curriculum Sampling •Prioritized Sampling •Dynamic Sampling Performance comparison across different sampling methods on various math benchmarks. Metrics are Avg@32 for AIME and standard accuracy for others. We present the best results in bold and the second with underline.

11. Experiment CDAS achieves the best average accuracy across benchmarks. Outperforms strong baselines, including Dynamic Sampling. Achieves comparable or better performance with: •~50% fewer training steps •~57% reduction in training overhead vs Dynamic Sampling CDAS achieves the best performance while demonstrating significant efficiency advantages compared to the strong Dynamic Sampling baseline.

12. Analysis: Sample Utility CDAS implicitly reduces zero-gradient samples. Compared to baselines: • Fewer problems with pass rate = 0 or 1 Leads to: • Higher effective gradient signal • Faster learning The proportion of zero-gradient problems in the sampled batch.

13. Analysis: Difficulty vs Pass rate Pass rate alone cannot distinguish learning trajectories. CDAS difficulty incorporates: • Historical performance • Learning dynamics Problems with identical final pass rates can have very different difficulties and CDAS captures this nuance. Problem difficulty vs. pass rate in CDAS.

14. Generalization Generalizes to: •Code generation tasks •Larger models (Qwen2.5-14B) • Different architectures (OctoThinker) Accuracy comparison on LiveCodeBench v5 Achieves: •Consistent gains over Random Sampling •Comparable performance to Dynamic Sampling with lower cost Generalization Performance Across Different Architectures and Model Sizes

15. Conclusion We rethink RL sampling from a competence–difficulty alignment perspective CDAS provides: •Stable difficulty estimation •Dynamic alignment with model capability •A principled fixed-point formulation Achieves: •Higher accuracy •Better efficiency •Strong generalization 🔑 Key takeaway: Efficient RL requires matching the right problems to the right model at the right time.

16. Q&A

17. 更多技术干货欢迎关注“美团技术团队”