针对工具调用模型优化的 Token 级策略梯度方法

如果无法正常显示，请先停止浏览器的去广告插件。

1. REST: RESHAPING TOKEN-LEVEL POLICY GRADIENTS FOR TOOL-USE LARGE LANGUAGE MODELS 针对工具调用模型优化的 Token 级策略梯度方法汇报人：Xiaohan Wang 美团SA后训练算法团队美团业务研发平台

2. Outline 1Background and Motivation 2Theory Analysis 3Proposed Method 4Experimental Analysis 5Conclusions and Future work

3. Background Tool-use LLMs •Search engine •Calculator •Code interpreter •APIs SFT or RL？ Tool-use requires: • exploration • decision making • multi-step interaction SFT:fixed trajectories RL:interactive exploration

4. Background Reinforcement Learning for Tool-use LLMs LLMs interact with tool environments and learn tool-use policies through reward feedback.

5. Background Problem: Reward Allocation in GRPO for Tool-use LLMs Tool-use tasks violate GRPO’s assumption. • Correctness determined by tool call tokens • Many irrelevant reasoning tokens • Misaligned reward signal Token-level advantage reshape is needed!

6. Method Motivation • • Standard GRPO has coarse credit assignment Token entropy is linked to training stability Design principle • • upweight structural low-entropy tokens first gradually increase reasoning-token weights later

7. Theory analysis Trajectory-level credit assignment in standard GRPO Variance is the key bottleneck for stable optimization Entropy directly relates to gradient variance Key Insight: Structured tokens usually have lower entropy and are more directly related to rewards, while open- ended reasoning tokens have higher entropy and introduce larger gradient variance.

8. Theory analysis Reweighted token-level policy gradient Optimal weighting rule Practical entropy-based surrogate or the average entropy of different token regions, such as format tags, tool names, key parameters, and chain-of-thought tokens.

9. Proposed Method Step 1: Single-turn decomposition Decompose multi-turn tool-use trajectories into step-level training instances, enabling each step to receive a dedicated reward signal. Step 2: Region-aware token partition Partition generated responses into format tags, tool names, parameters, and reasoning tokens for fine-grained optimization. Step 3: Entropy-aware policy optimization Reweight token-level policy gradients using region-level entropy, gradually shifting focus from structural correctness to semantic reasoning.

10. Reward Design Format score A binary score that checks whether all required fields are complete and appear in the correct order. Tool-calling correctness score • tool name matching •parameter name matching •parameter value matching Dynamic scaling reward Step-level rewards provide dense supervision for each tool-use decision, instead of relying only on final multi-turn outcomes.

11. Proposed Method Region-wise Curriculum Learning • training is scheduled to move from format validity toward semantic reasoning • format tokens receive less emphasis over time • parameter and reasoning tokens receive more emphasis as training progresses

12. Experiments Setup To comprehensively evaluate our method, we conduct experiments on the Qwen3 family and Llama-3.2- 3B-Instruct. We implement all experiments with veRL 0.5.0 and compare against strong baselines, including SFT, TSFT, RSFT, GRPO, SFT+GRPO, and Dr.GRPO. For evaluation, we report results on: BFCL and API-Bank

13. Experiments BFCL： A comprehensive function-calling benchmark that tests single-turn, multi- turn, live execution, irrelevant tool rejection, and multi-tool usage for tool- use large language models. Metric: Overall accuracy and category-wise accuracy on multi-turn BFCL ResT outperforms standard GRPO at all model scales.

14. Experiments APIBank A multi-turn tool-use benchmark with 73 APIs and 3 difficulty levels, designed to evaluate tool selection and argument generation in natural dialogue settings.

15. Experiments Ablation Takeaways •Removing dynamic reward, CoT gradients, or curriculum learning degrades performance. •Curriculum learning is the most critical component, with drops up to 4.86 points. •The full ResT achieves the best overall accuracy across all tested model scales.

16. Experiments Motivation SFT warm-starting changes the entropy distribution of different token regions, which may make the original entropy-aware weighting less aligned with the model’s post-SFT state. Key finding A tuned curriculum is necessary after SFT warm-starting. It achieves the best BFCL performance and shows a clear synergistic effect between SFT initialization and curriculum alignment.

17. Experiments • • ResT achieves the best performance without extra KL loss or entropy reward, validating the effectiveness of the proposed design. The gain remains consistent on Qwen3-32B, showing that ResT scales beyond smaller models and standard GRPO.

18. Real-world Results •Offline Meituan Benchmark： 30B model after SFT w/o ResT: 1.58 / 1.71 200B model after SFT w/o ResT: 1.63 / 1.84 •Online A/B test: Stage I performance: Online Core Metric: +5.59 % Stage II performance: Online Core Metric: +7.79 % Stage III performance: Online Core Metric: +5.89% Which has already affected hundreds of thousands of users in the real world through “问小团” application To help people eat better, live better

19. Conclusion Reveal that tool-use rewards concentrate on structured, low-entropy tokens, while uniform token updates lead to coarse credit assignment and higher optimization variance. Propose ResT: an entropy-aware token-level gradient reshaping method with curriculum learning. Achieve state-of-the-art results on BFCL and API-Bank, with gains of up to 8.76%, and outperform GPT-4o in several tool-use settings. Paper： https://openreview.net/pdf?id=gNZlaKRWki Code： https://github.com/1229095296/ResT_Tool_use_LLM.git

20. Q&A

21. 更多技术干货欢迎关注“美团技术团队”