使用 GRPO 微调 LFM2.5-1.2B-Instruct

In this notebook, we will explore the core concepts of GRPO (Group Relative Policy Optimization) by fine-tuning LFM2.5-1.2B-Instruct using Unsloth.

在本笔记本中，我们将通过使用 Unsloth 微调 LFM2.5-1.2B-Instruct 来探索 GRPO (Group Relative Policy Optimization) 的核心概念。

GRPO is a reinforcement learning algorithm designed for training language models with reward signals instead of labeled examples. In contrast to supervised fine-tuning (SFT), where you tell the model the exact right answer, GRPO lets the model explore different outputs and reinforces the ones that score higher on the reward functions. That’s why GRPO is ideal for verifiable tasks where you can programmatically evaluate the correctness of model outputs, such as math problems, code generation, and structured data tasks.

GRPO 是一种强化学习算法，旨在使用奖励信号而不是标注示例来训练语言模型。与监督微调 (SFT) 不同，在 SFT 中您告诉模型确切的正确答案，GRPO 让模型explore 不同的输出，并强化在奖励函数上得分更高的那些。这就是为什么 GRPO 非常适合可验证任务，在这些任务中，您可以以编程方式评估模型输出的正确性，例如数学问题、代码生成和结构化数据任务。

LFM2.5-1.2B-Instruct is a general-purpose instruction-tuned model. Since it is quite small, it is suitable for agentic tasks, data extraction, and RAG but less so for knowledge-intensive tasks and programming.

LFM2.5-1.2B-Instruct 是一个通用指令调优模型。由于它非常小，因此适合 agentic 任务、数据提取和 RAG，但不太适合知识密集型任务和编程。

In this example, LFM2.5-1.2B-Instruct learns to extract structured invoice fields from noisy OCR text. The outputs are easy to verify programmatically for GRPO: we can reward the model for producing valid JSON, using the right schema, and recovering the correct values.

在这个例子中，LFM2.5-1.2B-Instruct 学习从噪声 OCR 文本中提取结构化的发票字段。输出对于 GRPO 来说易于程序化验证：我们可以奖励模型产生有效的 JSON、使用正确的 schema，并恢复正确的值。

Prerequisites

先决条件

To run this tutorial as a notebook, you will need:

要将此教程作为笔记本运行，您需要：

GPU for fine-tuning .If you don’t have one locally, you can run this notebook for free on Google Colab using a free NVIDIA T4 GPU instance or on Kaggle
用于微调的GPU。如果您本地没有一个，可以在Google Colab上免费运行此笔记本，使用免费的 NVIDIA T4 GPU 实例，或者在Kaggle上。
Optional: HF_TOKEN for faster downloads of the training dataset from Hugging Face.
可选： HF_TOKEN 用于从 Hug...