veRL for Training Coding Agent

1. veRL for Training Coding Agent 姓名张驰 verl项发起巫锡斌 verl核 maintainer

2.

3. 01 Introduction to verl

4. verl in 2025 - Same front-end code - Auto-backend selection

5. Reinforcement Learning is important Reinforcement learning (RL) at Bytedance • Classic Alignment with Human values • Reasoning: O1/Claude-3.7 performance on math benchmarks • Image/video/music generation • Agentic LLM tool using • Desktop operator, coding assistant, Gaming…

6. Introduction to Reinforcement Learning Supervised fine-tuning • Learning from labeled examples • Optimize a single model Reinforcement Learning • Maximize the reward • Play with multiple models in a single system

7. HybridFlow Programming Abstraction Make LLM training/rollout as a service • Classic Alignment with Human values • Reasoning: O1/Claude-3.7 performance on math benchmarks • Image/video/music generation • Agentic LLM tool using • Desktop operator, coding assistant, Gaming…

8. Agents Agent: software systems that use AI to reasoning, planning, and memory and autonomy to make decisions, learn, and adapt. - Tool calling: Allowing the LLM to select and use various tools as needed. - Memory: Enabling the agent to retain and use information from previous steps. - Planning: Empowering the LLM to create and follow multi- step plans to achieve goals. Agent RL: training LLM to make better decisions in complex, dynamic, real world.

9. Reinforcement Learning for Agents - AgentLoop Rollout AgentLoop: given a user prompt, execute user defined loop, output multi-turn chat history as trajectory. - Search: online web search - MCP tools: image, video edit, … - Code sandbox: execute code, python, java, … - Virtual machine: operate browser, ppt, excel, … - Android emulator: operate app -…

10. Reinforcement Learning for Agents - System Highlight - Server mode: vllm/sglang AsyncLLM engine - Parallel running: asyncio loop run multiple prompts in parallel - Load balance and sticky session: better kv cache utilization

11. Pitfalls in Agent RL Qwen3 tokenizer - decode([35946, 20412, 105165]) -> 我是中国 ) -> [104198, 105165] - encode(我是中国

12. Pitfalls in Agent RL (cont’)

13. 02 Code Agent RL Training with verl

14. How To Evaluate LLM Code Ability? Real world software engineering is more complicated than solving a few lines of code: • Search • Navigate • View • Edit • Run We need more comprehensive benchmark! SWE-bench SWE-bench, Carlos E. Jimenez, et al, 2023

15. SWE-bench Scrape Github PRs from 12 popular repositories: • problem_statement: The issue title and body • patch: The gold patch, the patch generated by the PR • test_patch: A test-file patch that was contributed by the solution PR • …

16. SWE-bench SWE-bench has been widely adopted to evaluate LLM code ability. SWE-bench Leaderboards

17. Build Large Scale SWE-bench Infrastructure SWE-bench Infrastructure • SWE-agent: enable LLM to autonomously use tools to fix issues. • SWE-Rex: runtime interface for interacting with sandboxed shell environments. • veFaaS: Volcano Engine Function as a Service, provides image cache, sandbox isolation, fast deployment.

18. veFaaS • Kata container with strong isolation • Image warmup with nydus and P2P distribution • Image affinity scheduling

19. SWE-agent Step 1~5: setup container, install tools, and initialize shell session Step 6: setup agent with tool config yaml Step 7~11: agent query model, parse action and execute shell command SWE-agent architecture

20. SWE-agent SWE-agent implementation Tool definition

21. Train SWE-agent with verl: rollout 1. AgentLoop Abstraction AgentLoop: given a user prompt, execute user defined loop, output multi-turn chat history as trajectory. • Search: online web search • MCP: image, video edit, ... • Code sandbox: execute code, python, java, ... • Virtual machine: operate browser, ppt, excel, ... • Android emulator: operate app • …

22. Train SWE-agent with verl: rollout 2. SWEAgentLoop SWE-agent instance Run specific task: image, commit, problem, … Run tests with patch to get reward score Convert messages to AgentLoopOutput

23. Train SWE-agent with verl: rollout 3. Bridge the gap between chat completion and token-in-token-out ChatModel • Encode messages and tools to tokens in request • Decode tokens and extract tool call in response • Keep chat history in tokens and dump as trajectory

24. Train SWE-agent with verl: training Reinforcement Learning with Verifiable Reward(RLVR) • reward: model generated patch pass all test cases • algorithm: PPO or GRPO/DAPO/GSPO Group relative advantage estimation: We will release the training recipe soon! DeepSeekMath, Zhihong Shao, et al, 2024

25. Limitation and Future Work Limitation: AgentLoop chat history append only • How to compress context? • How to handle extreme long-context? Some awesome works based on verl! microsoft/agent-lightning BytedTsinghua-SIA/MemAgent

26.

27. THANKS 模型正在重新定义软件 Large Language Model Is Redefining The Software