使用中学数学从零理解 LLMs

A self-contained, full explanation to inner workings of an LLM

对 LLM 内部工作原理的自包含完整解释

In this article, we talk about how Large Language Models (LLMs) work, from scratch — assuming only that you know how to add and multiply two numbers. The article is meant to be fully self-contained. We start by building a simple Generative AI on pen and paper, and then walk through everything we need to have a firm understanding of modern LLMs and the Transformer architecture. The article will strip out all the fancy language and jargon in ML and represent everything simply as they are: numbers. We will still call out what things are called to tether your thoughts when you read jargon-y content.

在本文中，我们从零开始讨论大型语言模型（LLMs）的工作原理，只假设你知道如何将两个数字相加和相乘。本文旨在完全自包含。我们从在纸笔上构建一个简单的生成式AI开始，然后逐步讲解一切，以便你对现代LLMs和Transformer架构有坚实的理解。本文将剔除ML中所有花哨的语言和行话，将一切简单地表示为它们本来的样子：数字。我们仍然会指出事物的名称，以便你在阅读行话内容时能将你的想法联系起来。

Going from addition/multiplication to the most advanced AI models today without assuming other knowledge or referring to other sources means we cover a LOT of ground. This is NOT a toy LLM explanation — a determined person can theoretically recreate a modern LLM from all the information here. I have cut out every word/line that was unnecessary and as such this article isn’t really meant to be browsed.

从加法/乘法到当今最先进的AI模型，而不假设其他知识或引用其他来源，意味着我们涵盖了大量的领域。这不是一个玩具LLM解释——一个有决心的人理论上可以从这里的所有信息中重现一个现代LLM。我已经删除了每一个不必要的词/行，因此本文并不是真的适合浏览。

What will we cover?

我们将涵盖什么？

A simple neural network
一个简单的 neural network
How are these models trained?
这些模型是如何训练的？
How does all this generate language?
这一切是如何生成语言的？
What makes LLMs work so well?
是什么让 LLMs 工作得如此出色？
Embeddings
Embeddings
Sub-word tokenizers
Sub-word tokenizers
Self-attention
Self-attention
Softmax
Softmax
Residual connections
Residual connections
Layer Normalization
Layer Normalization
Dropout
Dropout
Multi-head attention
Multi-head attention
Positional embeddings
Positional embeddings
The GPT architecture
GPT 架构
The transformer architecture
transformer 架构

Let’s dive in.

让我们深入了解。

The first thing to note is that neural network...