Process Reward Models- Math to Code

1. Process Reward Models: Math to Code Zhengyu Chen, Yudong Wang, Teng Xiao, Ruochen Zhou, Xuesheng Yang, Wei Wang, Zhifang Sui, Jingang Wang 美团 LongCat 团队

2. CONTENTS 01 Introduction & Motivation 02 Related Works 03 Training Pipeline 04 Scaling Laws 05 Test-Time Strategies 06 Cross-Domain Generalization 07 Key Takeaways & Future Work

3. 01 Introduction & Motivation

4. Why Process Reward Models Matter Granular Step Feedback Unlike outcome-only verifiers, PRMs score every reasoning step, catching errors early, reducing hallucinations, and enabling iterative self-correction during generation. Cross-Domain Promise Proven in math, PRMs now raise the question: can step-level supervision transfer to code, and how should we scale training vs. test-time compute?

5. Problem Scope & Core Questions Compute Scaling 01 How do pre-training and reward-training FLOPs trade off against downstream accuracy on complex reasoning tasks? 02 Test-Time Search Which strategy—Best-of-N, Beam, MCTS, Majority Vote—maximizes correctness per token or per second at inference? 03 Domain Transfer Does a PRM trained exclusively on math datasets rival or surpass one trained on code data when evaluated on HumanEval+, MBPP+, and LiveCodeBench?

6. 02 Training Pipeline

7. Process Reward Model Core Mechanism: Enhance reasoning capabilities by providing intermediate feedback. Let’s Verify Step by Step (Lightman et al. 2023). MathShepherd: Minimizes manual effort via automated annotation (Wang et al. 2024).

8. Test Time Scaling Compute Efficiency: Efficient allocation of test-time compute can significantly outperform larger models.

9. Scaling of RL Aggregate the ability from exploration in test-time scaling back into the model itself.

10. 03 Training Pipeline

11. Setup Data Collection 1.2 M multi-domain chains harvested from PRM800K, Math-Shepherd, TACO, APPS, ensuring varied difficulty and topic coverage. Consensus Filtering LLM Simulation Rollouts generated with Qwen2.5; Monte-Carlo plus binary-search pinpoint first error step, creating silver labels at scale. Ensemble of three LLMs retains only steps where all agree on correctness, cutting label noise by 18 % versus single-model annotation. Search Strategies Include Best of N, Beam Search, Monte-Carlo Tree Search, and Majority Voting Scalar Value Head We initialize PRMs from Qwen2.5 0.5 B–72 B, replace the LM head with a scalar value head, and train with binary cross-entropy on step labels y, yielding a probability p that step x is correct, enabling real-time credit assignment during generation.

12. 04 Scaling Laws

13. Compute vs Accuracy Trade-off Diminishing Returns Accuracy jumps from 0.5 B to 7 B, then plateaus; 72 B adds <0.5 pp for 10× FLOPs, signaling optimal stop at 7 B.

14. Training Data Diversity Diminishing Returns The choice and diversity of training datasets significantly impact the perfor_x0002_mance of Process Reward Models.

15. 05 Test-Time Strategies

16. Strategies Best Of N (BON) Generate N candidates and select the highest-scoring solution according to a preference reward model. This approach improves solution quality by exploring diverse reasoning paths. Beam Search Maintain K highest-scoring partial solutions at each step. For each path, calculate its cumulative scores. This efficiently balances exploration and computational resource allocation. Search Strategies Represent the reasoning process as a tree where nodes are states and edges are actions. Use a policy, such as Upper Confidence Bound for Trees (UCT), to select the next nodePerform rollouts to simulate the outcome of following a particular path. Update the value estimates of nodes based on the results of the Majority Voting Generate multiple candi_x0002_date solutions. Aggregate the final answers, selecting the most frequently occurring answer. Majority voting leverages the collective insights from multiple solutions, enhancing the robustness of the final answer.

17. Search Strategy Comparison in Token View High Budget Winner MCTS with UCT policy dominates when tokens are plentiful, delivering 1.8 pp higher accuracy than Beam and 3.2 pp over Best-of-N on MATH-500.

18. Search Strategy Comparison in Time View Low Budget Winner Best-of-N offers the highest accuracy under strict latency constraints. In contrast, MCTS dominates the performance frontier, making it the superior choice when compute is sufficient (or abundant).

19. 06 Cross-Domain Generalization

20. Math-Trained PRMs on Code Math-to-Code Generalization PRMs trained solely on Math data demonstrate superior cross-domain generalization. Math-trained PRMs frequently outperform domain-specific Code PRMs.

21. Math-Trained PRMs on Code Math refer to all responses, Math_PRM refer to those selected by PRM. Specific Pattern PRM prefer sepcifcc patterns like self- critique, while these pattern are more likely to generate a correct answer

22. 07 Key Takeaways & Future Work

23. Practical Guidelines Right-Size Models Stop at 14B parameters; beyond this, accuracy gains are marginal while GPU costs double. Pick Search to Budget Cloud GPUs → MCTS; edge CPUs → Best-of-N; avoid Majority Vote unless diversity is critical. Prepare Diverse Training data More diverse data leading to more good patterns.

24. Limitations & Next Steps Current Limits Experiments confined to math and code; wider domain evaluation and larger model scales await more GPU resources. Future Works 1. Extend ASLAF to science & logic, integrate PRM signals into RL fine-tuning, and build adaptive search routers that switch strategies on-the-fly based on runtime compute budgets. 2. Agentic tasks are more easy to be process verified, maybe PRM will perform better in such tasks.

25. 更多技术干货欢迎关注“美团技术团队” 欢迎关注大模型团队岗位