Kimi、Cursor 和 Chroma 如何使用 RL 训练 Agentic Models

I read three recent technical reports from Moonshot AI's [Kimi K2.5 paper](https://arxiv.org/html/2602.02276v1), Cursor's [Composer 2 report](https://arxiv.org/html/2603.24477v2) and [blog post](https://cursor.com/blog/real-time-rl-for-composer), and Chroma's [Context-1 write-up](https://www.trychroma.com/research/context-1). Each report introduces something distinct. Kimi K2.5 trains an **Agent Swarm** where the model learns to decompose tasks into parallel sub-agents through RL. Cursor's Composer 2 uses **self-summarization** to handle long coding sessions and runs **real-time RL** from production traffic. Chroma's Context-1 teaches the model **self-editing context**: actively pruning retrieved documents to free up space for further search. All three use reinforcement learning with similar methodology: 1. **Start from a strong base model.** None of them train from scratch. Moonshot extends Kimi K2 with multimodal pre-training. Cursor starts from Kimi K2.5 (1T parameters/32B active MoE). Chroma from gpt-oss-20B. 2. **Train inside the production harness.** Each team runs RL rollouts through the same tools, prompts, and execution environments that their model encounters in production. 3. **Outcome-based rewards.** All three use verifiable outcome signals and exceptionally Generative Reward Models (GRMs) for open-ended tasks/style/constitutions. 4. **Asynchronous, large-scale rollouts.** Each system generates parallel trajectories per training step. Agent rollouts are expensive, all invested in infrastructure to run them at scale. ## Kimi K2.5: Agent Swarm and Parallel Agent Orchestration Through RL **Paper:** [Kimi K2.5: Visual Agentic Intelligence](https://arxiv.org/html/2602.02276v1) Kimi K2.5 is Moonshot AI's multimodal model with a 1T parameter / 32B active MoE architecture. Its most distinctive feature is **Agent Swarm**, a framework where the model learns to dynamically decompose tasks into parallel subtasks and dispatch them to sub-agents. The parallelization strategy is learned through reinforcement lea...