Process Reward Models- Math to Code
如果无法正常显示,请先停止浏览器的去广告插件。
1. Process Reward Models:
Math to Code
Zhengyu Chen, Yudong Wang, Teng Xiao, Ruochen Zhou, Xuesheng Yang,
Wei Wang, Zhifang Sui, Jingang Wang
美团 LongCat 团队
2. CONTENTS
01 Introduction & Motivation
02 Related Works
03 Training Pipeline
04 Scaling Laws
05 Test-Time Strategies
06 Cross-Domain Generalization
07 Key Takeaways & Future Work
3. 01
Introduction & Motivation
4. Why Process Reward Models Matter
Granular Step Feedback
Unlike outcome-only verifiers, PRMs score every reasoning step, catching errors early, reducing
hallucinations, and enabling iterative self-correction during generation.
Cross-Domain Promise
Proven in math, PRMs now raise the question: can
step-level supervision transfer to code, and how
should we scale training vs. test-time compute?
5. Problem Scope & Core Questions
Compute Scaling
01
How do pre-training and reward-training FLOPs trade off against downstream
accuracy on complex reasoning tasks?
02
Test-Time Search
Which strategy—Best-of-N, Beam, MCTS, Majority Vote—maximizes
correctness per token or per second at inference?
03
Domain Transfer
Does a PRM trained exclusively on math datasets rival or surpass one trained on
code data when evaluated on HumanEval+, MBPP+, and LiveCodeBench?
6. 02
Training Pipeline
7. Process Reward Model
Core Mechanism: Enhance reasoning capabilities by providing intermediate feedback.
Let’s Verify Step by Step (Lightman et al. 2023).
MathShepherd: Minimizes manual effort via automated annotation (Wang et al. 2024).
8. Test Time Scaling
Compute Efficiency: Efficient allocation of test-time compute can significantly
outperform larger models.
9. Scaling of RL
Aggregate the ability from exploration in test-time scaling back into the model itself.
10. 03
Training Pipeline
11. Setup
Data Collection
1.2 M multi-domain chains harvested from PRM800K, Math-Shepherd,
TACO, APPS, ensuring varied difficulty and topic coverage.
Consensus Filtering
LLM Simulation
Rollouts generated with Qwen2.5;
Monte-Carlo plus binary-search pinpoint
first error step, creating silver labels at
scale.
Ensemble of three LLMs retains only steps where all agree on
correctness, cutting label noise by 18 % versus single-model annotation.
Search Strategies
Include Best of N, Beam Search, Monte-Carlo Tree Search, and Majority
Voting
Scalar Value Head
We initialize PRMs from Qwen2.5 0.5 B–72 B, replace the LM head with a
scalar value head, and train with binary cross-entropy on step labels y,
yielding a probability p that step x is correct, enabling real-time credit
assignment during generation.
12. 04
Scaling Laws
13. Compute vs Accuracy Trade-off
Diminishing Returns
Accuracy jumps from 0.5 B to 7 B, then plateaus; 72 B adds <0.5 pp
for 10× FLOPs, signaling optimal stop at 7 B.
14. Training Data Diversity
Diminishing Returns
The choice and diversity of training datasets significantly impact the perfor_x0002_mance of Process
Reward Models.
15. 05
Test-Time Strategies
16. Strategies
Best Of N (BON)
Generate N candidates and select the highest-scoring solution according to a preference reward model. This
approach improves solution quality by exploring diverse reasoning paths.
Beam Search
Maintain K highest-scoring partial solutions at each step. For each path, calculate its cumulative scores. This
efficiently balances exploration and computational resource allocation.
Search Strategies
Represent the reasoning process as a tree where nodes are states and edges are actions. Use a policy, such as
Upper Confidence Bound for Trees (UCT), to select the next nodePerform rollouts to simulate the outcome of
following a particular path. Update the value estimates of nodes based on the results of the
Majority Voting
Generate multiple candi_x0002_date solutions. Aggregate the final answers, selecting the most frequently
occurring answer. Majority voting leverages the collective insights from multiple solutions, enhancing the
robustness of the
final answer.
17. Search Strategy Comparison in Token View
High Budget Winner
MCTS with UCT policy dominates when tokens are plentiful, delivering 1.8 pp
higher accuracy than Beam and 3.2 pp over Best-of-N on MATH-500.
18. Search Strategy Comparison in Time View
Low Budget Winner
Best-of-N offers the highest accuracy under strict latency constraints. In contrast,
MCTS dominates the performance frontier, making it the superior choice when
compute is sufficient (or abundant).
19. 06
Cross-Domain Generalization
20. Math-Trained PRMs on Code
Math-to-Code Generalization
PRMs trained solely on Math data
demonstrate superior cross-domain
generalization.
Math-trained PRMs frequently
outperform domain-specific Code
PRMs.
21. Math-Trained PRMs on Code
Math refer to all responses,
Math_PRM refer to those selected by PRM.
Specific Pattern
PRM prefer sepcifcc patterns like self-
critique, while these pattern are more likely
to generate a correct answer
22. 07
Key Takeaways & Future Work
23. Practical Guidelines
Right-Size Models
Stop at 14B parameters; beyond this, accuracy gains are marginal while GPU costs double.
Pick Search to Budget
Cloud GPUs → MCTS; edge CPUs → Best-of-N; avoid Majority Vote unless diversity is critical.
Prepare Diverse Training data
More diverse data leading to more good patterns.
24. Limitations & Next Steps
Current Limits
Experiments confined to math and code; wider domain evaluation and larger model scales await more GPU
resources.
Future Works
1. Extend ASLAF to science & logic, integrate PRM signals into RL fine-tuning, and build adaptive search
routers that switch strategies on-the-fly based on runtime compute budgets.
2. Agentic tasks are more easy to be process verified, maybe PRM will perform better in such tasks.
25. 更多技术干货
欢迎关注“美团技术团队”
欢迎关注
大模型团队岗位