基于可验证逐步奖励机制的高效推理优化
如果无法正常显示,请先停止浏览器的去广告插件。
1. 论文分享——
Promoting Efficient Reasoning with
Verifiable Stepwise Reward
基于可验证逐步奖励机制的高效推理优化
报告人:Chuhuai Yue
美团业务研发平台
2. Outline
01Background and Motivation
02The Essence of overthinking
03Proposed methodology
04Experimental Analysis
05Conclusions and Future work
3. Background
Current LRMs
1. Longer CoT
2. Step-by-step answer
3. Great reasoning performance
However...
1. empty talk
2. longer latency
3. wrong answers
Overthinking!!
Generated tokens for question “what is the
answer of 2+3?”[1]
4. Repeatedly reflection
[1] Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; Wang, R.; Tu, Z.; Mi, H.; and Yu, D. 2025. Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models. In Forty-second International
Conference on Machine Learning
4. Background
An example of overthinking issue for QwQ-32B-Preview model’s output response[1]
[1] Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; Wang, R.; Tu, Z.; Mi, H.; and Yu, D. 2025. Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models. In Forty-second International
Conference on Machine Learning
5. Background
RLVR binary reward → “better redundant than right” → overthinking issue
Existing methods can be broadly divided into two categories:
• predefining a token budget.
• “difficulty-adaptive reasoning”, smarter, yet still budget-driven.
Both require pre-evaluating the task, which not only heavily depends on the accuracy of difficulty
prediction, but also lacks flexibility in dealing with real-world problems.
6. Motivation
Rethink the Essence of Overthinking Issue
Reasoning problems, especially
mathematical and programming tasks,
are naturally suited to step-by-step
answering.
We dive deeper and rethink the
essence of "overthinking" issue at the
step level.
7. Motivation
Rethink the Essence of Overthinking Issue
3 stages of reasoning process:
1. Problem restatement
2. Step-by-step solution (worst-hit zone)
3. Answer summarization
Many obviously ineffective steps neither improve
accuracy nor help the model approach the correct
answer, and such cases are not isolated incidents.
8. Motivation
Rethink the Essence of Overthinking Issue
We prompt DeepSeek-R1 to analyze all 500 responses and calculate the frequency of
ineffective steps. Not prefect predictions, but still enough for the identification.
The numerous ineffective steps are the main cause of overthinking.
Accurately distinguishing between effective and ineffective steps and
encouraging the former, penalizing the latter
9. Methodology
Make Stepwise Reward Verifiable
Outcome-only RLVR can’t
reward single steps
Introduce stepwise rewards:
encourage the good steps,
penalize the bad.
Applying a PRM is the usual
trick, but brittle at scale
Propose a Verifiable Stepwise
Reward Mechanism (VSRM)
Combine the flexibility of
stepwise rewards with the
reliability of outcome-based
rewards
10. Methodology
Step Seperation
Previous research has attempted to use additional models to identify different solutions or correct answers within the
chain of thought (CoT), but introducing new models brings extra uncertainty. Therefore, we designed a rule-based
step segmentation algorithm.
11. Methodology
Step Seperation
Specific tokens such as “however,” “thus,”
“so,” “but,” “wait,” etc. can be used for
Segmentation.
Complete process:
• use regular expressions to extract the reasoning content within
• skip the initial tokens to avoid segmenting the problem restatement part
• segment T using the predefined list of special tokens.
To ensure readability, we introduce two additional rules:
• ≥ I_min tokens between cuts
• cut only at sentence start with a token
[2] Wang, S.; Yu, L.; Gao, C.; Zheng, C.; Liu, S.; Lu, R.; Dang, K.; Chen, X.; Yang, J.; Zhang, Z.; Liu, Y.; Yang, A.; Zhao, A.; Yue, Y.; Song, S.; Yu, B.; Huang, G.; and Lin, J. 2025. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective
Reinforcement Learning for LLM Reasoning. arXiv:2506.01939.
12. Methodology
Assigning Reward to Intermediate Stage
Based on the segmentation points, we obtain subrollouts and use as new queries for the model to generate answers.
In this way, the reward at each segmentation point is modeled as the correctness of the answer to the corresponding
subrollout.
13. Methodology
Assigning Reward to Intermediate Stage
We sample multiple subsequent answers for each
subrollout and use the average accuracy as the reward.
Take it a step further, we use the difference in
accuracy between adjacent steps as the reward,
encouraging the main rollout to evolve in a direction
of continuously improving accuracy.
14. Methodology
Assigning Reward to Intermediate Stage
To enrich reward singals, we additionally
introduce a lookahead window mechanism that
propagates subsequent improvements to the
current step, alleviating the problem of overly
sparse reward signals and accelerating training.
15. Methodology
Reinforcement Learning with VSRM
Combining traditional binary outcome
rewards, format rewards, and stepwise
rewards, VSRM is fully compatible with RL
algorithms that support stepwise rewards,
such as PPO and Reinforce++, and can be
seamlessly integrated.
16. Experiments
Setup
To comprehensively validate the effectiveness of our method, we selected three mainstream LRMs as the foundation:
•DS-Distill-1.5B
•DS-Distill-7B
•DeepScaleR (all with publicly available weights)
The training framework uses VeRL, and we run both PPO and Reinforce++ algorithms.
For training, we directly reuse the official collection from DeepScaleR, covering difficulty levels from elementary to olympiad.
For evaluation, we selected classic mathematical benchmarks:
•MATH-500
•AIME24 / AIME25
•AMC23
•Minerva
•OlympiadBench
17. Experiments
Results
VSRM achieves an excellent balance between performance and efficiency, demonstrating
its superiority in fundamentally addressing the overthinking issue.
18. Experiments
Results
Ablation experiments illustrate the relationships and roles of each component of our proposed
method, and demonstrate the rationality and effectiveness of VSRM.
19. Experiments
Results
VSRM does not hinder the model’s ability to explore valuable reasoning paths;
on the contrary, it encourages the exploration of effective steps
20. Experiments
Results
VSRM effectively reduces the occurrence of ineffective steps, thereby decreasing output
length and fundamentally alleviating the overthinking problem.
21. Conclusion
Address overthinking problem in LRMs by setting the optimization objective to encourage effective
intermediate steps and penalize ineffective ones.
Propose verifiable stepwise reward mechanism
Significantly reducing overthinking while maintaining or slightly improving reasoning performance.
Paper: https://arxiv.org/pdf/2508.10293
Code: https://github.com/1benwu1/VSRM-Efficient-LRMs
See more work of our team (AsX) at google scholar
22. Q&A
23. 更多技术干货
欢迎关注“美团技术团队”