你不知道 LLM 训练：原则、管道和新实践

TL;DR

After writing "The Claude Code You Don't Know" and "The AI Agents You Don't Know," I wanted to tackle a third installment. This time I pushed myself to work through how large model training actually works, and tried to write something that a non-specialist reader could follow.

在写完“The Claude Code You Don't Know”和“The AI Agents You Don't Know”之后，我想处理第三部。这次我推动自己深入了解大型模型训练的实际工作方式，并试图写一些非专业读者能够跟上的东西。

Looking at 2026, what actually separates frontier models is no longer pretraining itself. The gap increasingly lives in everything after it: post-training, evaluation, reward design, agent training, and distillation. Each step shapes what users feel. When a model suddenly seems much stronger, it's usually several of these improving together, not any single factor.

展望 2026 年，真正区分前沿模型的不再是预训练本身。差距越来越多地存在于之后的一切：训练后处理、评估、奖励设计、agent 训练和蒸馏。每一步都塑造用户感受到的东西。当一个模型突然显得强大得多时，通常是这些几个方面一起改进，而不是任何一个单一因素。

The rest of this piece follows the LLM training pipeline in order, focusing on how the back half of the training stack drives the final shipped quality.

本文其余部分将按顺序遵循 LLM 训练流程，重点关注训练栈的后半部分如何驱动最终发布的质量。

LLM Training Is an Assembly Line

LLM 训练是一条装配线

For years, progress in language models was explained by stacking more parameters, data, and compute. But much of what users notice isn't from training on more base text. It comes from the entire pipeline that runs after pretraining. How a model talks, follows instructions, reasons, and uses tools doesn't grow naturally from feeding it more internet text.

多年来，语言模型的进步被解释为堆叠更多参数、数据和计算。但用户注意到的许多东西并不是来自在更多基础文本上训练。它来自于预训练后运行的整个管道。模型如何说话、遵循指令、推理和使用工具，并不是通过喂它更多互联网文本自然增长的。

InstructGPT gave a clear early example: a 1.3B-parameter model that had been alignment-tuned with preference optimization beat 175B GPT-3 in human preference evals. Two orders of magnitude fewer parameters, and users liked the smaller model better. The back half of training rewrites user perception.

InstructGPT 提供了一个明确的早期例子：一个经过偏好优化的对齐调优的 1.3B 参数模型在人类偏好评估中击败了 175B GPT-3。参数数量少了两个数量级，用户更喜欢这个较小的模型。训练的后半部分重塑了用户的认知。

Training is an a...