MicroGPT 交互式讲解

Trying my best to visualize it. I'm a n00b at machine learning though

尽我所能地尝试可视化它。不过我对机器学习是个 n00b

Andrej Karpathy wrote a 200-line Python script that trains and runs a GPT from scratch, with no libraries or dependencies, just pure Python. The script contains the algorithm that powers LLMs like ChatGPT.

Andrej Karpathy 编写了一个 200 行的 Python 脚本，从零训练并运行 GPT，没有任何库或依赖，只有纯 Python。该脚本包含驱动像 ChatGPT 这样的 LLMs 的算法。

Let's walk through it piece by piece and watch each part work. Andrej did a walkthrough on his blog, but here I take a more visual approach, tailored for beginners.

让我们一步一步地走一遍，看看每个部分如何工作。Andrej 在他的博客上做了演练，但这里我采用更视觉化的方法，专为初学者量身定制。

The dataset

数据集

The model trains on 32,000 human names, one per line: emma, olivia, ava, isabella, sophia... Each name is a document. The model's job is to learn the statistical patterns in these names and generate plausible new ones that sound like they could be real.

模型在 32,000 个人类名字上训练，每行一个：emma, olivia, ava, isabella, sophia... 每个名字是一个文档。模型的任务是学习这些名字中的统计模式，并生成听起来像是真实的可信新名字。

By the end of training, the model produces names like "kamon", "karai", "anna", and "anton".The model has learned which characters tend to follow which, which sounds are common at the start vs. the end, and how long a typical name runs. From ChatGPT's perspective, your conversation is just a document. When you type a prompt, the model's response is a statistical document completion.

训练结束时，模型会生成像 "kamon"、"karai"、"anna" 和 "anton" 这样的名字。模型已经学会了哪些字符倾向于跟随哪些，哪些声音在开头与结尾常见，以及典型名字的长度。从 ChatGPT 的视角来看，你们的对话只是一个文档。当你输入提示时，模型的响应就是一个统计文档补全。

Numbers, not letters

数字，不是字母

Neural networks work with numbers, not characters. So we need a way to convert text into a sequence of integers and back. The simplest possible tokenizer assigns one integer to each unique character in the dataset. The 26 lowercase letters get ids 0 through 25, and we add one special token called BOS (Beginning of Sequence) with id 26 that marks where a name starts and ends.

Neural networks 处理数字，而不是字符。因此，我们需要一种将文本转换为整数序列并反向转换的方法。最简单的 tokenizer 为数据集中的每个唯一字符分配一个整数。26 个小...