veRL for Training Coding Agent
如果无法正常显示,请先停止浏览器的去广告插件。
        
                相关话题:
                                    #AI Agent
                            
                        
                1. veRL for Training Coding Agent
姓名
张驰
verl项 发起
巫锡斌 verl核
maintainer            
                        
                2.             
                        
                3. 01
Introduction to verl            
                        
                4. verl in 2025
- Same front-end code
- Auto-backend selection            
                        
                5. Reinforcement Learning is important
Reinforcement learning (RL) at Bytedance
• Classic Alignment with Human values
• Reasoning: O1/Claude-3.7 performance
on math benchmarks
• Image/video/music generation
• Agentic LLM tool using
• Desktop operator, coding assistant,
Gaming…            
                        
                6. Introduction to Reinforcement Learning
Supervised fine-tuning
• Learning from labeled examples
• Optimize a single model
Reinforcement Learning
• Maximize the reward
• Play with multiple models in a single
system            
                        
                7. HybridFlow Programming Abstraction
Make LLM training/rollout as a service
• Classic Alignment with Human values
• Reasoning: O1/Claude-3.7 performance
on math benchmarks
• Image/video/music generation
• Agentic LLM tool using
• Desktop operator, coding assistant,
Gaming…            
                        
                8. Agents
Agent: software systems that use AI to reasoning, planning, and
memory and autonomy to make decisions, learn, and adapt.
- Tool calling: Allowing the LLM to select and use various tools
as needed.
- Memory: Enabling the agent to retain and use information
from previous steps.
- Planning: Empowering the LLM to create and follow multi-
step plans to achieve goals.
Agent RL: training LLM to make better decisions in complex,
dynamic, real world.            
                        
                9. Reinforcement Learning for Agents - AgentLoop
Rollout
AgentLoop: given a user prompt, execute user defined loop, output
multi-turn chat history as trajectory.
- Search: online web search
- MCP tools: image, video edit, …
- Code sandbox: execute code, python, java, …
- Virtual machine: operate browser, ppt, excel, …
- Android emulator: operate app
-…            
                        
                10. Reinforcement Learning for Agents - System
Highlight
- Server mode: vllm/sglang AsyncLLM engine
- Parallel running: asyncio loop run multiple prompts in
parallel
- Load balance and sticky session: better kv cache utilization            
                        
                11. Pitfalls in Agent RL
Qwen3 tokenizer
- decode([35946, 20412, 105165]) -> 我是中国
) -> [104198, 105165]
- encode(我是中国            
                        
                12. Pitfalls in Agent RL (cont’)            
                        
                13. 02
Code Agent RL Training
with verl            
                        
                14. How To Evaluate LLM Code Ability?
Real world software engineering is more
complicated than solving a few lines of code:
• Search
• Navigate
• View
• Edit
• Run
We need more comprehensive benchmark!
SWE-bench
SWE-bench, Carlos E. Jimenez, et al, 2023            
                        
                15. SWE-bench
Scrape Github PRs from 12 popular repositories:
• problem_statement: The issue title and body
• patch: The gold patch, the patch generated by
the PR
• test_patch: A test-file patch that was
contributed by the solution PR
•
…            
                        
                16. SWE-bench
SWE-bench has been widely adopted to evaluate LLM code ability.
SWE-bench Leaderboards            
                        
                17. Build Large Scale SWE-bench Infrastructure
SWE-bench Infrastructure
• SWE-agent: enable LLM to autonomously use tools to fix issues.
• SWE-Rex: runtime interface for interacting with sandboxed shell environments.
• veFaaS: Volcano Engine Function as a Service, provides image cache, sandbox
isolation, fast deployment.            
                        
                18. veFaaS
• Kata container with strong isolation
• Image warmup with nydus and P2P distribution
• Image affinity scheduling            
                        
                19. SWE-agent
Step 1~5: setup container, install tools, and initialize shell session
Step 6: setup agent with tool config yaml
Step 7~11: agent query model, parse action and execute shell command
SWE-agent architecture            
                        
                20. SWE-agent
SWE-agent implementation
Tool definition            
                        
                21. Train SWE-agent with verl: rollout
1. AgentLoop Abstraction
AgentLoop: given a user prompt, execute user defined loop,
output multi-turn chat history as trajectory.
• Search: online web search
• MCP: image, video edit, ...
• Code sandbox: execute code, python, java, ...
• Virtual machine: operate browser, ppt, excel, ...
• Android emulator: operate app
• …            
                        
                22. Train SWE-agent with verl: rollout
2. SWEAgentLoop
SWE-agent instance
Run specific task: image, commit, problem, …
Run tests with patch to get reward score
Convert messages to AgentLoopOutput            
                        
                23. Train SWE-agent with verl: rollout
3. Bridge the gap between chat completion and token-in-token-out
ChatModel
• Encode messages and tools to tokens in request
• Decode tokens and extract tool call in response
• Keep chat history in tokens and dump as trajectory            
                        
                24. Train SWE-agent with verl: training
Reinforcement Learning with Verifiable Reward(RLVR)
• reward: model generated patch pass all test cases
• algorithm: PPO or GRPO/DAPO/GSPO
Group relative advantage estimation:
We will release the training recipe soon!
DeepSeekMath, Zhihong Shao, et al, 2024            
                        
                25. Limitation and Future Work
Limitation: AgentLoop chat history append only
• How to compress context?
• How to handle extreme long-context?
Some awesome works based on verl!
microsoft/agent-lightning
BytedTsinghua-SIA/MemAgent            
                        
                26.             
                        
                27. THANKS
模型正在重新定义软件
Large Language Model Is Redefining The Software