我训练了一个语言模型来与GRPO安排事件!
Community Article Published April 29, 2025
社区文章 发布于2025年4月29日
It's 2025 and, after the DeepSeek boom, everyone wants to train their own reasoning model using GRPO.
现在是2025年,在DeepSeek繁荣之后,每个人都想使用GRPO训练自己的推理模型。
As a practitioner at heart, I wanted to do the same: it's fascinating to make a Language Model learn from just prompts and rewards - no completions, unlike Supervised Fine-Tuning.
作为一个内心的从业者,我想做同样的事情:让一个语言模型仅通过提示和奖励来学习是非常迷人的——与监督微调不同,没有完成。
Most examples you can find online train models on GSM8K or the Countdown Game. I wanted to try something original and get my hands dirty.
你在线上找到的大多数例子都是在 GSM8K 或倒计时游戏上训练模型。我想尝试一些原创的东西,亲自动手。
So I thought: can I make a model create a schedule from a list of events and priorities?
所以我想:我能否让模型从事件和优先事项列表中创建一个日程?
My first experiments showed that ChatGPT can generally solve this type of problem, while Small Language Models (under 14B) struggle. A good challenge!
我的第一次实验表明,ChatGPT通常可以解决这类问题,而小型语言模型(低于14B)则很挣扎。这是一个很好的挑战!
What I did not realize is that picking an original problem would have forced me to think about the problem setting, generate data, choose the base model, design reward functions, and run multiple rounds of training, hoping that my model would learn something.
我没有意识到选择一个原始问题会迫使我思考问题设置、生成数据、选择基础模型、设计奖励函数,并进行多轮训练,希望我的模型能学到一些东西。
A lot of things to learn, and that's exactly what I want to share with you in this article.
有很多东西需要学习,这正是我想在这篇文章中与您分享的内容。
You can find all the code in the 👑 🗓️ Qwen Scheduler GRPO repository.
您可以在👑 🗓️ Qwen Scheduler GRPO 仓库中找到所有代码。
Follow along!
继续跟随!
This article is mostly about my hands-on experience. Having some theoretical knowledge of GRPO may help. You can find several resources online, like the DeepSeekMath paper and the Hugging Face Reasoning Course.
这篇文章主要是关于我的实践经验。对 GRPO 有一些理论知识可能会有所帮助。你可以在网上找到几个资源,比如 DeepSeekMath 论文 和 Hugging Face 推理课程。
Problem definition
问题定义
Let's describe the problem we want our Language Model to solve.
让我们描述一下我们希望语言模型解决的问题。
We give the model a list of events (with start/end times) and tell it which ones are high priority....