使用中学数学从零理解 LLMs
A self-contained, full explanation to inner workings of an LLM
对 LLM 内部工作原理的自包含完整解释
In this article, we talk about how Large Language Models (LLMs) work, from scratch — assuming only that you know how to add and multiply two numbers. The article is meant to be fully self-contained. We start by building a simple Generative AI on pen and paper, and then walk through everything we need to have a firm understanding of modern LLMs and the Transformer architecture. The article will strip out all the fancy language and jargon in ML and represent everything simply as they are: numbers. We will still call out what things are called to tether your thoughts when you read jargon-y content.
在本文中,我们从零开始讨论大型语言模型(LLMs)的工作原理,只假设你知道如何将两个数字相加和相乘。本文旨在完全自包含。我们从在纸笔上构建一个简单的生成式AI开始,然后逐步讲解一切,以便你对现代LLMs和Transformer架构有坚实的理解。本文将剔除ML中所有花哨的语言和行话,将一切简单地表示为它们本来的样子:数字。我们仍然会指出事物的名称,以便你在阅读行话内容时能将你的想法联系起来。
Going from addition/multiplication to the most advanced AI models today without assuming other knowledge or referring to other sources means we cover a LOT of ground. This is NOT a toy LLM explanation — a determined person can theoretically recreate a modern LLM from all the information here. I have cut out every word/line that was unnecessary and as such this article isn’t really meant to be browsed.
从加法/乘法到当今最先进的AI模型,而不假设其他知识或引用其他来源,意味着我们涵盖了大量的领域。这不是一个玩具LLM解释——一个有决心的人理论上可以从这里的所有信息中重现一个现代LLM。我已经删除了每一个不必要的词/行,因此本文并不是真的适合浏览。
What will we cover?
我们将涵盖什么?
- A simple neural network
- 一个简单的 neural network
- How are these models trained?
- 这些模型是如何训练的?
- How does all this generate language?
- 这一切是如何生成语言的?
- What makes LLMs work so well?
- 是什么让 LLMs 工作得如此出色?
- Embeddings
- Embeddings
- Sub-word tokenizers
- Sub-word tokenizers
- Self-attention
- Self-attention
- Softmax
- Softmax
- Residual connections
- Residual connections
- Layer Normalization
- Layer Normalization
- Dropout
- Dropout
- Multi-head attention
- Multi-head attention
- Positional embeddings
- Positional embeddings
- The GPT architecture
- GPT 架构
- The transformer architecture
- transformer 架构
Let’s dive in.
让我们深入了解。
The first thing to note is that neural network...