LLMs 中的 KV Caching,清晰解释

You must have seen it every time you use ChatGPT or Claude that the first token takes noticeably longer to appear. Then the rest stream out almost instantly.

你每次使用 ChatGPT 或 Claude 时一定见过,第一 个 token 出现明显更慢。然后其余的几乎瞬间流出。

Behind the scenes, it's a deliberate engineering decision called KV caching, and the purpose is to make LLM inference faster.

在幕后,这是一个名为 KV caching 的 deliberate engineering decision,其目的是使 LLM inference 更快。

Before we get into the technical details, here's a side-by-side comparison of LLM inference with and without KV caching:

在进入技术细节之前,这里是使用和不使用 KV 缓存的 LLM 推理的并排比较:

0:01 / 0:47

0:01 / 0:47

Now let's understand how it works, from first principles.

现在让我们从第一性原理理解它是如何工作的。

Part 1: How LLMs generate tokens

第 1 部分:LLMs 如何生成 tokens

The transformer processes all input tokens and produces a hidden state for each one. Those hidden states get projected into vocabulary space, producing logits (one score per word in the vocabulary).

Transformer 处理所有输入 token 并为每个产生隐藏状态。这些隐藏状态被投影到词汇空间,产生 logits(词汇中每个词一个分数)。

GIF

GIF

But only the logits from the last token matter. You sample from them, get the next token, append it to the input, and repeat.

但只有最后一个 token 的 logits 重要。你从中采样,获取下一个 token,将其追加到输入中,并重复。

GIF

GIF

This is the key insight: to generate the next token, you only need the hidden state of the most recent token. Every other hidden state is an intermediate byproduct.

这是关键洞见:要生成下一个 token,您只需最近 token 的 hidden state。其他所有 hidden state 都是中间副产品。

Part 2: What Attention actually computes

第 2 部分:Attention 实际计算的内容

Inside each transformer layer, every token gets three vectors: a query (Q), a key (K), and a value (V). Attention multiplies queries against keys for scores, then uses those scores to weight the values.

在每个 transformer 层中,每个 token 都会得到三个向量:一个 query (Q)、一个 key (K) 和一个 value (V)。Attention 将 queries 与 keys 相乘得到 scores,然后使用这些 scores 来加权 values。

GIF

GIF

Now focus on just the last token.

现在只关注最后一个 token。

GIF

GIF

The last row of QK^T uses:

QK^T 的最后一行使用:

  • The query vector of the last token
  • 最后一个 token 的 query vector
  • All key vectors in th...
开通本站会员,查看完整译文。

trang chủ - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.0. UTC+08:00, 2026-03-24 03:24
浙ICP备14020137号-1 $bản đồ khách truy cập$