LLM推理工作原理，清晰解析。

Every generate() call to an LLM runs two distinct computational phases on the same GPU:

每次对LLM调用generate()都会在同一个GPU上运行两个不同的计算阶段：

prefill (processing the prompt) is compute-bound

prefill（处理提示词）是计算受限的
while decode (generating tokens one at a time) is memory-bound.

而解码（一次生成一个 token）则是受内存带宽限制的。

Most inference optimizations target one phase or the other, and diagnosing which phase is the bottleneck is the first step in making a deployment faster.

大多数推理优化针对其中一个阶段，诊断哪个阶段是瓶颈是使部署更快的第一步。

In this article, I'll walk through the full pipeline, from tokenized input to streamed output, and look at where the time goes in each phase.

在本文中，我将梳理完整的流程，从tokenized输入到流式输出，并探讨每个阶段的时间消耗在哪里。

Tokenization and embedding

分词与嵌入

Tokenizers like Byte Pair Encoding (BPE) convert raw text into integer IDs from a vocabulary of roughly 50,000 tokens.

像 Byte Pair Encoding (BPE) 这样的分词器将原始文本转换为来自大约 50,000 个词元词汇表的整数 ID。

python

prompt = "How does inference work?"
ids = tokenizer.encode(prompt)
# ids -> [2437, 1374, 32278, 670, 30]

Each ID maps to a row in the embedding table, a learned matrix of shape [vocab_size, hidden_dim]. For a model with a hidden dimension of 4,096, each token becomes a 4,096-dimensional vector.

每个 ID 映射到 embedding table 中的一行，这是一个形状为 [vocab_size, hidden_dim] 的学习矩阵。对于隐藏维度为 4,096 的模型，每个 token 都会变成一个 4,096 维的向量。

python

# embedding_table has shape [vocab_size, hidden_dim]
vectors = embedding_table[ids]   # shape: [num_tokens, 4096]

[

](https://x.com/_avichawla/article/2071201619530956863/media/2071197201846751233)

Position information gets injected at this stage.

位置信息在此阶段被注入。

Most modern architectures use Rotary Position Embeddings (RoPE), which encode position by rotating the embedding vectors rather than adding a separate positional vector.

大多数现代架构使用旋转位置编码（RoPE），它通过旋转嵌入向量来编码位置，而不是添加单独的位置向量。

Transformer layers

Transformer 层

The embedded sequence passes through a stack of transformer layers (typically 32 to 80+, depending on model size).

嵌...