How LLM Inference Works, Clearly Explained.
Every generate() call to an LLM runs two distinct computational phases on the same GPU:
-
prefill (processing the prompt) is compute-bound
-
while decode (generating tokens one at a time) is memory-bound.
Most inference optimizations target one phase or the other, and diagnosing which phase is the bottleneck is the first step in making a deployment faster.
In this article, I'll walk through the full pipeline, from tokenized input to streamed output, and look at where the time goes in each phase.
Tokenization and embedding
Tokenizers like Byte Pair Encoding (BPE) convert raw text into integer IDs from a vocabulary of roughly 50,000 tokens.
python
prompt = "How does inference work?"
ids = tokenizer.encode(prompt)
# ids -> [2437, 1374, 32278, 670, 30]
Each ID maps to a row in the embedding table, a learned matrix of shape [vocab_size, hidden_dim]. For a model with a hidden dimension of 4,096, each token becomes a 4,096-dimensional vector.
python
# embedding_table has shape [vocab_size, hidden_dim]
vectors = embedding_table[ids] # shape: [num_tokens, 4096]
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071197201846751233)
Position information gets injected at this stage.
Most modern architectures use Rotary Position Embeddings (RoPE), which encode position by rotating the embedding vectors rather than adding a separate positional vector.
Transformer layers
The embedded sequence passes through a stack of transformer layers (typically 32 to 80+, depending on model size).
Each layer applies two operations in sequence:
- Self-attention computes three projections per token (query Q, key K, value V) via learned weight matrices.
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071198233637126144)
Each token's query is scored against every other token's key, and those scores (after scaling and softmax) determine how much of each token's value gets mixed in.
python
# scores: how much each token attends to every other token
Q, K, V = x @ Wq, x @ Wk, x @ Wv
scores = (Q @ K.T) / sqrt (d_k)
weights = softmax(scaled) # one row per token, sums to 1
attn_output = weights @ V
- Feed-forward network (FFN) processes each token's vector independently through a two-layer MLP. Attention moves information between positions. The FFN transforms it.
After the final layer, the model projects the last token's hidden state back to vocabulary size ([hidden_dim, vocab_size]), applies softmax, and samples from the resulting distribution to produce the first output token.
Prefill: the compute-bound phase
Processing the input prompt is the first phase. All tokens are processed in parallel: Q, K, and V are computed for every token simultaneously, and attention runs as a large matrix-matrix multiplication.
This is compute-bound work. The GPU's arithmetic throughput is the bottleneck, and utilization is high. The metric that captures this phase is Time to First Token (TTFT), the latency before the first output token appears.
During prefill, the model also populates the KV cache: the K and V tensors for every layer get stored in GPU memory for reuse.
python
# Prefill: process the whole prompt in one shot
hidden = embed(prompt_tokens) + positions
for layer in model.layers:
Q, K, V = project(hidden) # for ALL tokens at once
hidden = attention(Q, K, V) + hidden
hidden = feedforward(hidden) + hidden
cache_kv(layer, K, V) # save for later
first_token = sample(project_to_vocab(hidden[-1]))
Decode: the memory-bound phase
Once the first token is generated, the model switches to generating one token at a time. For each new token, it only computes Q, K, and V for that single token. The K and V from all previous tokens are already in the cache.
python
# Decode: one token per iteration
token = first_token
steps = 0
while token != STOP and steps < MAX_STEPS:
x = embed(token) + position(steps)
for layer in model.layers:
q, k, v = project(x)
K_all, V_all = caches[layer].append(k, v) # cached history + new
x = layer.forward(q, K_all, V_all, x) # attention + FFN, residuals
token = sample(project_to_vocab(x))
steps += 1
yield token
The arithmetic per step is tiny (one query vector against the cached key matrix instead of a full matrix-matrix multiply). But the GPU still loads every weight matrix and the entire cached K/V from memory for that small computation. The bottleneck flips from compute to memory bandwidth.
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071200289718411264)
The metric for this phase is Inter-Token Latency (ITL): the time between consecutive output tokens. Low ITL is what makes a model feel responsive.
The KV cache
Without caching, generating a 1,000-token response would require recomputing attention over the entire growing sequence at every step, giving quadratic complexity.
The KV cache stores each layer's K and V tensors once and appends new entries incrementally.
The video below depicts LLM inference speed with vs. without KV caching:
0:00 / 0:47
The speedup is roughly 5x or more for long generations.
The cost is that the cache grows linearly with sequence length and exists per-layer. For a 13B-parameter model, the cache consumes roughly 1 MB per token. A 4K-token context burns through 4 GB of VRAM on the cache alone.
This is why long contexts get expensive. The cache competes directly with batch size for GPU memory, i.e., more cache per request means fewer concurrent requests per GPU.
Standard mitigations include quantizing the cache to INT8 or INT4, sliding window attention (dropping tokens outside a fixed window), grouped-query attention (GQA, sharing K/V across attention heads to reduce the number of cached tensors), and PagedAttention (the memory management trick behind vLLM that pages the cache like an OS pages virtual memory, eliminating fragmentation).
There's another interesting idea that I talked about around KV cache management below:
[

Avi Chawla
@_avichawla
·
A tricky LLM interview question: You're serving a reasoning model on vLLM, and it keeps running out of GPU memory on long traces. So you add KV cache compression and evict 90% of the cached tokens. VRAM usage stays as is and GPU still runs out of memory. Why? (answer below)
Show more
![]()
Quote

Avi Chawla
@_avichawla
·
Mar 20
Article
KV Caching in LLMs, Clearly Explained
You must have seen it every time you use ChatGPT or Claude that the first token takes noticeably longer to appear. Then the rest stream out almost instantly. Behind the scenes, it's a deliberate...
37
312
2.2K
[
259K
](https://x.com/_avichawla/status/2070828078247604480/analytics)
Redesigning attention around the cache
Quantization and paging treat the KV cache as a fixed cost to manage. DeepSeek's V4 series (released April 2025) takes a different approach: redesign attention so the cache is structurally smaller from the start.
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071198752623521793)
V4 uses a hybrid of two compressed attention mechanisms.
Compressed Sparse Attention (CSA) compresses KV entries by 4x using softmax-gated pooling, then applies sparse attention over the compressed tokens.
Heavily Compressed Attention (HCA) is more aggressive. It consolidates KV entries across 128 tokens into a single compressed entry and applies dense attention over those representations.
At a 1M-token context, V4-Pro requires 27% of the single-token inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2.
In absolute terms, that's 9.62 GiB of KV cache per sequence at 1M context in bf16, compared to an estimated 83.9 GiB for a V3.2-style architecture. With fp4/fp8 quantization on top, the cache shrinks by another 2x.
The KV cache has become the constraint that the field is optimizing the model architecture around.
Quantization
Training uses FP32 or BF16 for gradient stability. Inference doesn't need that precision. The memory savings from reducing bit width are linear:
-
7B parameters at FP32: 28 GB
-
7B parameters at FP16/BF16: 14 GB
-
7B parameters at INT8: 7 GB
-
7B parameters at INT4: 3.5 GB
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071198887461965824)
INT4 is why 7B models run on laptop GPUs with 4-6 GB of VRAM. Methods like GPTQ and AWQ use per-channel scaling factors to minimize quality degradation from the lossy compression.
Done well, INT4 lands within 1-2 percentage points of the full-precision model on standard benchmarks.
Going from FP16 to INT8 often cuts inference latency in half with negligible quality loss, making quantization the single highest-leverage optimization for most deployments.
Serving infrastructure
Modern inference servers wrap the prefill-decode loop with several optimizations:
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071198977631064064)
-
Continuous batching interleaves tokens from multiple requests on the same GPU step, keeping utilization high even during memory-bound decode phases.
-
Speculative decoding uses a small draft model to propose multiple tokens, then the large model verifies them in a single forward pass. When the draft model's acceptance rate is high, this effectively converts multiple sequential decode steps into one parallel verification.
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071199133617225728)
I covered Speculative decoding in detail here:
[

Avi Chawla
@_avichawla
·
Article
How to get 2-3x faster LLM inference with speculative decoding (used by Google in production)
The technique Google, Anthropic, and Meta use for 2-3x faster LLM inference. Covered with internals, code, tradeoffs, and what's replacing the two-model setup. Google uses speculative decoding in AI...
1
8
48
[
39K
](https://x.com/_avichawla/status/2054860740541207032/analytics)
- PagedAttention (vLLM) manages KV cache memory in fixed-size blocks, eliminating fragmentation and enabling more concurrent requests per GPU.
Frameworks like vLLM, TensorRT-LLM, and Text Generation Inference (TGI) combine these techniques. A single GPU can serve dozens of concurrent users because decode leaves most of the arithmetic capacity idle, and continuous batching fills that idle capacity with other requests.
The full inference path
[
](https://x.com/_avichawla/article/2071201619530956863/media/2071199371937660929)
-
Tokenize: Text becomes integer IDs via BPE.
-
Embed: IDs become vectors. RoPE encodes position.
-
Prefill: All input tokens are processed in parallel through every layer. Compute-bound. KV cache populated. First token emitted.
-
Decode loop: One token per step: project Q for the new token, attend over cached K/V, run FFN, sample. Append new K/V to cache. Memory-bound.
-
Detokenize: Token IDs mapped back to text and streamed.
Some practical implications
-
long prompts are expensive in TTFT (prefill)
-
long outputs are expensive in ITL (decode)
-
and they stress different hardware resources.
-
Context length isn't free because it bloats the KV cache and directly reduces batch capacity.
-
GPU utilization during decode can drop to 30% even on a fully loaded server, because the bottleneck is memory bandwidth, not arithmetic.
-
The fix isn't more compute, it's faster memory, a smaller cache, or better batching.
When someone tells you their model is slow, the first diagnostic is whether it's slow to start (prefill-bound, optimize TTFT) or slow to stream (decode-bound, optimize ITL).
👉 Over to you: are you running into TTFT or ITL bottlenecks in your deployments, and what's worked for you?
That's a wrap!
If you enjoyed this tutorial:
Find me →
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.