基于可验证逐步奖励机制的高效推理优化

如果无法正常显示，请先停止浏览器的去广告插件。

1. 论文分享—— Promoting Efficient Reasoning with Verifiable Stepwise Reward 基于可验证逐步奖励机制的高效推理优化报告人：Chuhuai Yue 美团业务研发平台

2. Outline 01Background and Motivation 02The Essence of overthinking 03Proposed methodology 04Experimental Analysis 05Conclusions and Future work

3. Background Current LRMs 1. Longer CoT 2. Step-by-step answer 3. Great reasoning performance However... 1. empty talk 2. longer latency 3. wrong answers Overthinking！！ Generated tokens for question “what is the answer of 2+3？”[1] 4. Repeatedly reflection [1] Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; Wang, R.; Tu, Z.; Mi, H.; and Yu, D. 2025. Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models. In Forty-second International Conference on Machine Learning

4. Background An example of overthinking issue for QwQ-32B-Preview model’s output response[1] [1] Chen, X.; Xu, J.; Liang, T.; He, Z.; Pang, J.; Yu, D.; Song, L.; Liu, Q.; Zhou, M.; Zhang, Z.; Wang, R.; Tu, Z.; Mi, H.; and Yu, D. 2025. Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models. In Forty-second International Conference on Machine Learning

5. Background RLVR binary reward → “better redundant than right” → overthinking issue Existing methods can be broadly divided into two categories: • predefining a token budget. • “difficulty-adaptive reasoning”, smarter, yet still budget-driven. Both require pre-evaluating the task, which not only heavily depends on the accuracy of difficulty prediction, but also lacks flexibility in dealing with real-world problems.

6. Motivation Rethink the Essence of Overthinking Issue Reasoning problems, especially mathematical and programming tasks, are naturally suited to step-by-step answering. We dive deeper and rethink the essence of "overthinking" issue at the step level.

7. Motivation Rethink the Essence of Overthinking Issue 3 stages of reasoning process: 1. Problem restatement 2. Step-by-step solution (worst-hit zone) 3. Answer summarization Many obviously ineffective steps neither improve accuracy nor help the model approach the correct answer, and such cases are not isolated incidents.

8. Motivation Rethink the Essence of Overthinking Issue We prompt DeepSeek-R1 to analyze all 500 responses and calculate the frequency of ineffective steps. Not prefect predictions, but still enough for the identification. The numerous ineffective steps are the main cause of overthinking. Accurately distinguishing between effective and ineffective steps and encouraging the former, penalizing the latter

9. Methodology Make Stepwise Reward Verifiable Outcome-only RLVR can’t reward single steps Introduce stepwise rewards: encourage the good steps, penalize the bad. Applying a PRM is the usual trick, but brittle at scale Propose a Verifiable Stepwise Reward Mechanism (VSRM) Combine the flexibility of stepwise rewards with the reliability of outcome-based rewards

10. Methodology Step Seperation Previous research has attempted to use additional models to identify different solutions or correct answers within the chain of thought (CoT), but introducing new models brings extra uncertainty. Therefore, we designed a rule-based step segmentation algorithm.

11. Methodology Step Seperation Specific tokens such as “however,” “thus,” “so,” “but,” “wait,” etc. can be used for Segmentation. Complete process： • use regular expressions to extract the reasoning content within • skip the initial tokens to avoid segmenting the problem restatement part • segment T using the predefined list of special tokens. To ensure readability, we introduce two additional rules: • ≥ I_min tokens between cuts • cut only at sentence start with a token [2] Wang, S.; Yu, L.; Gao, C.; Zheng, C.; Liu, S.; Lu, R.; Dang, K.; Chen, X.; Yang, J.; Zhang, Z.; Liu, Y.; Yang, A.; Zhao, A.; Yue, Y.; Song, S.; Yu, B.; Huang, G.; and Lin, J. 2025. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning. arXiv:2506.01939.

12. Methodology Assigning Reward to Intermediate Stage Based on the segmentation points, we obtain subrollouts and use as new queries for the model to generate answers. In this way, the reward at each segmentation point is modeled as the correctness of the answer to the corresponding subrollout.

13. Methodology Assigning Reward to Intermediate Stage We sample multiple subsequent answers for each subrollout and use the average accuracy as the reward. Take it a step further, we use the difference in accuracy between adjacent steps as the reward, encouraging the main rollout to evolve in a direction of continuously improving accuracy.

14. Methodology Assigning Reward to Intermediate Stage To enrich reward singals, we additionally introduce a lookahead window mechanism that propagates subsequent improvements to the current step, alleviating the problem of overly sparse reward signals and accelerating training.

15. Methodology Reinforcement Learning with VSRM Combining traditional binary outcome rewards, format rewards, and stepwise rewards, VSRM is fully compatible with RL algorithms that support stepwise rewards, such as PPO and Reinforce++, and can be seamlessly integrated.

16. Experiments Setup To comprehensively validate the effectiveness of our method, we selected three mainstream LRMs as the foundation: •DS-Distill-1.5B •DS-Distill-7B •DeepScaleR (all with publicly available weights) The training framework uses VeRL, and we run both PPO and Reinforce++ algorithms. For training, we directly reuse the official collection from DeepScaleR, covering difficulty levels from elementary to olympiad. For evaluation, we selected classic mathematical benchmarks: •MATH-500 •AIME24 / AIME25 •AMC23 •Minerva •OlympiadBench

17. Experiments Results VSRM achieves an excellent balance between performance and efficiency, demonstrating its superiority in fundamentally addressing the overthinking issue.

18. Experiments Results Ablation experiments illustrate the relationships and roles of each component of our proposed method, and demonstrate the rationality and effectiveness of VSRM.

19. Experiments Results VSRM does not hinder the model’s ability to explore valuable reasoning paths; on the contrary, it encourages the exploration of effective steps

20. Experiments Results VSRM effectively reduces the occurrence of ineffective steps, thereby decreasing output length and fundamentally alleviating the overthinking problem.

21. Conclusion Address overthinking problem in LRMs by setting the optimization objective to encourage effective intermediate steps and penalize ineffective ones. Propose verifiable stepwise reward mechanism Significantly reducing overthinking while maintaining or slightly improving reasoning performance. Paper: https://arxiv.org/pdf/2508.10293 Code: https://github.com/1benwu1/VSRM-Efficient-LRMs See more work of our team (AsX) at google scholar

22. Q&A

23. 更多技术干货欢迎关注“美团技术团队”