针对工具调用模型优化的 Token 级策略梯度方法
如果无法正常显示,请先停止浏览器的去广告插件。
1. REST: RESHAPING TOKEN-LEVEL POLICY GRADIENTS FOR
TOOL-USE LARGE LANGUAGE MODELS
针对工具调用模型优化的 Token 级策略梯度方法
汇报人:Xiaohan Wang 美团SA后训练算法团队
美团业务研发平台
2. Outline
1Background and Motivation
2Theory Analysis
3Proposed Method
4Experimental Analysis
5Conclusions and Future work
3. Background
Tool-use LLMs
•Search engine
•Calculator
•Code interpreter
•APIs
SFT or RL?
Tool-use requires:
• exploration
• decision making
• multi-step interaction
SFT:fixed trajectories
RL:interactive exploration
4. Background
Reinforcement Learning for Tool-use LLMs
LLMs interact with tool environments and learn tool-use policies through reward feedback.
5. Background
Problem: Reward Allocation in GRPO for Tool-use LLMs
Tool-use tasks violate
GRPO’s assumption.
•
Correctness determined by tool call tokens
•
Many irrelevant reasoning tokens
•
Misaligned reward signal
Token-level advantage
reshape is needed!
6. Method Motivation
•
•
Standard GRPO has coarse credit assignment
Token entropy is linked to training stability
Design principle
•
•
upweight structural low-entropy tokens first
gradually increase reasoning-token weights later
7. Theory analysis
Trajectory-level credit assignment in standard
GRPO
Variance is the key bottleneck for stable
optimization
Entropy directly relates to gradient variance
Key Insight:
Structured tokens usually have lower entropy and are more directly related to rewards, while open-
ended reasoning tokens have higher entropy and introduce larger gradient variance.
8. Theory analysis
Reweighted token-level policy gradient
Optimal weighting rule
Practical entropy-based surrogate
or
the average entropy of different token regions, such as format tags, tool
names, key parameters, and chain-of-thought tokens.
9. Proposed Method
Step 1: Single-turn decomposition
Decompose multi-turn tool-use trajectories into
step-level training instances, enabling each step to
receive a dedicated reward signal.
Step 2: Region-aware token partition
Partition generated responses into format tags,
tool names, parameters, and reasoning tokens for
fine-grained optimization.
Step 3: Entropy-aware policy optimization
Reweight token-level policy gradients using
region-level entropy, gradually shifting focus
from structural correctness to semantic
reasoning.
10. Reward Design
Format score
A binary score that checks whether all required fields are complete and appear in the correct order.
Tool-calling correctness score
• tool name matching
•parameter name matching
•parameter value matching
Dynamic scaling reward
Step-level rewards provide dense supervision for each tool-use decision, instead of relying only on final
multi-turn outcomes.
11. Proposed Method
Region-wise Curriculum Learning
• training is scheduled to move from format validity toward semantic reasoning
• format tokens receive less emphasis over time
• parameter and reasoning tokens receive more emphasis as training progresses
12. Experiments
Setup
To comprehensively evaluate our method, we conduct experiments on the Qwen3 family and Llama-3.2-
3B-Instruct.
We implement all experiments with veRL 0.5.0 and compare against strong baselines, including SFT,
TSFT, RSFT, GRPO, SFT+GRPO, and Dr.GRPO.
For evaluation, we report results on: BFCL and API-Bank
13. Experiments
BFCL:
A comprehensive function-calling
benchmark that tests single-turn, multi-
turn, live execution, irrelevant tool
rejection, and multi-tool usage for tool-
use large language models.
Metric:
Overall accuracy and category-wise
accuracy on multi-turn BFCL
ResT outperforms standard GRPO at all model scales.
14. Experiments
APIBank
A multi-turn tool-use benchmark with
73 APIs and 3 difficulty levels, designed
to evaluate tool selection and argument
generation in natural dialogue settings.
15. Experiments
Ablation Takeaways
•Removing dynamic reward, CoT gradients, or curriculum learning degrades performance.
•Curriculum learning is the most critical component, with drops up to 4.86 points.
•The full ResT achieves the best overall accuracy across all tested model scales.
16. Experiments
Motivation
SFT warm-starting changes the entropy distribution of different token regions, which may make the
original entropy-aware weighting less aligned with the model’s post-SFT state.
Key finding
A tuned curriculum is necessary after SFT warm-starting.
It achieves the best BFCL performance and shows a clear synergistic effect between SFT initialization
and curriculum alignment.
17. Experiments
•
•
ResT achieves the best performance without extra KL loss or entropy reward, validating the
effectiveness of the proposed design.
The gain remains consistent on Qwen3-32B, showing that ResT scales beyond smaller models and
standard GRPO.
18. Real-world Results
•Offline Meituan Benchmark:
30B model after SFT w/o ResT: 1.58 / 1.71
200B model after SFT w/o ResT: 1.63 / 1.84
•Online A/B test:
Stage I performance: Online Core Metric: +5.59 %
Stage II performance: Online Core Metric: +7.79 %
Stage III performance: Online Core Metric: +5.89%
Which has already affected hundreds of thousands of users in the real world
through “问小团” application
To help people eat better, live better
19. Conclusion
Reveal that tool-use rewards concentrate on structured, low-entropy tokens,
while uniform token updates lead to coarse credit assignment and higher optimization variance.
Propose ResT:
an entropy-aware token-level gradient reshaping method with curriculum learning.
Achieve state-of-the-art results on BFCL and API-Bank,
with gains of up to 8.76%, and outperform GPT-4o in several tool-use settings.
Paper:
https://openreview.net/pdf?id=gNZlaKRWki
Code:
https://github.com/1229095296/ResT_Tool_use_LLM.git
20. Q&A
21. 更多技术干货
欢迎关注“美团技术团队”