以 KV 缓存为中心的高效长文本方法的优化和实践
如果无法正常显示,请先停止浏览器的去广告插件。
1. 演讲人:姜慧强
2. o Research SDE in Microsoft Research Asia (Shanghai)
▪ System-Algorithm Co-design
▪ Efficient methods to accelerate inference/training
3. 01 长文本大语言模型的应用和推理挑战
02 当前主流推理优化方法与技术
03 以KV缓存为中心的大语言模型推理架构
04
05
以KV缓存为中心的高效长文本方法
总结与展望
4.
5. 01
6. • Massive Pages of Docs • Extended Meeting Time • Lengthy Codebases
Complex Reasoning Endless Agentic History Lifelong Personalization
7. • Almost all latest models can process contexts exceeding 100K tokens.
10M tokens
≈ PyTorch repository code
≈ Lord of the Rings trilogy (1ps)
≈ 500 reasoning iterative *
https://lifearchitect.ai/models/#context-windows
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
8. ❑ Long Prefilling Latency, 30 minutes to process 1M tokens on an A100 for an 8B LLM.
❑ Large GPU Memory Consumption, 62GB of GPU memory is required for 512 K tokens in
fp16.
Long Prefilling Latency
=>MInference
Large GPU Memory Consumption
=>RetrievalAttention
9. Prompts
LLMLingua Series:
Prompt compression
MInference 1.0
/MMInference:
Dynamic sparse prefilling
Prefix
Caching
1
Tokens
gen.
Compute
Prefill
Decode
Retrieval Attention
Keys & Values
3
Compress
KV Cache
Storage
Alignment between
ANNS and Attention
Sparse Atten.
2
SCBench
Explore bound of KV caching
10. 02
当前主流推理优化方法与技术
11.
12. Prompt Caching
RadixAttention
Context Caching
Automatic Prefix
Caching
(a) Prefix caching is widely used in
LLM framework.
Prompt Caching
(b) Prefix caching is widely used
in LLM API.
13. 03
以KV缓存为中心的大语言模型
推理架构
14.
15. ❑ Long-context methods are designed and utilized around the KV
cache, but existing benchmarks focus only on single-request
scenarios, ignoring its full lifecycle in real-world use.
Repo-level Code
Debugging/ Long-
document QA
Prompt Caching
RadixAttention
Context Caching
Multi-turn Dialogue
Self-play Reasoning
(a) Long-Context is shared in
real-world scenarios.
Automatic Prefix
Caching
(b) Prefix caching is widely used in
LLM framework.
Prompt Caching
(c) Prefix caching is widely used
in LLM API.
16. ❑ Two typical shared context modes;
❑ Four category long-context capability;
17. ❑ 12 subtasks;
18. ❑ 13 long-context methods;
19. ❑ last window query region + A-shape;
20. ❑ Sub-O(n) Memory is Almost Infeasible in Multi-Turn Decoding.
❑ Long-generation scenarios exhibit distribution shift issues.
21.
22. 04
以KV缓存为中心的高效长文本
方法
23. (a) Attention is sparse.
(b) Sparsity of attention is dynamic.
in prefilling
• Figure. Dynamic sparsity in attention: (a) Top-k
(k=4096) columns cover a significant portion of attention scores in a
128k context. (b) Fewer scores are retrieved when reusing top-k
indices across examples, highlighting its dynamic nature.
Visualizations use LLaMA-3-8B on a single A100.
(c) Dynamically sparsity in decoding.
in decoding
Figure. (c) Dynamic sparsity in LLaMA-3-8B decoding in KV retrieval
(100k tokens): Dynamically selecting top-1000 tokens achieves 89%
recovery, while static selection drops to 71%.
24. ❑ After pretraining, attention exhibits various sparse patterns, including A-shape,
Vertical-Slash, and block-sparse patterns.
❑ These sparse patterns are fixed for each head across different inputs.
❑ The specific sparse elements (e.g., column index, slash index) dynamically
change depending on the context.
Local
Windows
Sink
Tokens
Important
Tokens
RoPE/
N-gram
Retrieval
Tokens
25. • MInference utilizes the inherent dynamic sparsity found in LLMs
alongside optimized GPU kernel design to enhance the TTFT.
02 03
Pattern Search Estimation Acceleration
Kernel Aware Sparse
Pattern Search Online Estimation of
Sparsity Indices Dynamic Sparse
Attention Calculation
with PIT,
FlashAttention
01
Step1: Offline
Step2: Online
26. Latency at 1M tokens
on a single A100
30 mins
10x
3 mins
FlashAttn
Minference
GPUs needed for sub-20s
latency at 1M tokens
60+ A100
8x
8 A100
FlashAttn
Minference
27. • 1. NIAH
2. RULER Avg Tokens: 4K-128K
3. Latency Benchmark
28. ❑ Local tokens in temporal and
spatial dimensions are evenly
distributed within the attention
map.
❑ Stride and starting position vary
with context, the horizontal and
vertical lines are evenly spaced and
often symmetrical.
29. ❑ 1) Intramodality consistency; 2) Modality-separated continuity
30. MMInference
31. MMInference
32. MMInference: Grid Head in Multi-Modality
33. MMInference: Q-Boundary pattern
34. MMInference: 2D-Boundary pattern
35.
36.
37.
38.
39. ❑ The VS pattern shifts to a Grid pattern when the input transitions
from text to visual.
40. SeerAttention: Learned Gated
FlexPrefill: TopP
SampleAttention: topP + column
41.
42. DiTFastAttn
Sparse VideoGen
STA
AdaSpa
SpargeAttn
43. • ANNS: Approximate Nearest Neighbor Search
• Specifically, using inner-product as the distance metric, also
known as Maximum Inner Product Search (MIPS).
Inner-
Product
Attn 𝑞, 𝐾, 𝑉 = Softmax
𝑞𝐾
𝑑
𝑉
Smooth one-
hot argmax
44. • Evaluation with RULER and ∞-Bench
45. ❑ Queries and keys have different distributions in attention. Off-the-
shelf ANNS indexes perform poorly on 𝑄 → 𝐾 searches while work
well for 𝐾 → 𝐾 searches.
• about 30~50% of key vectors are required to scan to maintain an
acceptable performance.
(a) ANNS index performance.
(b) Different distribution.
46. GPU requirement reduction by GPU-CPU Co-Execution: Lower-End GPU + CPU = High-End
GPU
✓ Reduce memory access and data transfer with OOD-aware ANNS index
✓ Enable 128K inference on RTX 4090 with 5 tokens/second;
Hot Tokens
GPU
KV
KV
KV
Partial
Attention
Combine
Offload Most KV to CPU Side
Query Vector
Retrieval
Search
CPU
Dynamically Activated Tokens
Indexed by ANNS index
Partial
Attention
Nearest KV Tokens
(dynamically retrieved)
Attention
Output
47. • One NVIDIA RTX 4090 (24GB) to handle 128K tokens for LLMs with
8B parameters, generating a token in 0.188 seconds.
40GB A100
24GB RTX4090
48.
49.
50. 05
总结与展望
51. SCBench: A KV Cache-Centric Analysis of
Long-Context Methods
Project Page
Code
Long-context methods are designed and utilized around the KV cache, but existing
benchmarks focus only on single-request scenarios, ignoring its full lifecycle in real-world use.
(b) Two Shared Context Modes
We propose SCBench, a KV cache-centric benchmark for analyzing long-
context methods, covering KV cache generation, compression, retrieval, and
loading. It includes four capability tasks and two shared-context modes,
from which we derive the following insights:
➢ Sub-O(n) memory is almost infeasible in multi-turn decoding;
➢ Task performance shows varying decline trends;
➢ All long-context methods experience performance degradation as the
compression rate decreases;
➢ Long-generation scenarios exhibit distribution shift issues.
(c) Overview of SCBench
52. MMInference: Accelerating Pre-filling for Long-Context
VLMs via Modality-Aware Permutation Sparse Attention
Long-context enables powerful vision and multi-modal applications,
but prefill cost remains a major bottleneck.
(a) Grid pattern.
We propose MMInference, a modality-aware permutation-based dynamic sparse
attention method for multi-modal inputs. It introduces Grid-Shape and Q-/2D-
Boundary sparse attention, achieving up to 8.3× speedup without sacrificing
performance. MMInference addresses two key challenges:;
➢ Vision-specific inductive bias → handled via Grid-Shape attention;
➢ Modality boundary in mixed inputs → addressed by Q-/2D-Boundary attention; (c) Q-Boundary pattern.
(b) Permuted Grid pattern.
(d) 2D-Boundary pattern.
53. RetroInfer: A Vector-Storage Approach for Scalable
Long-Context LLM Inference
We build KV Cache as a Vector
Storage System in a CPU-GPU
co-execution setup (Fig1) to
acc-elerate long-context LLM
inference without model
accuracy loss (Fig3). The core
Fig2. Attention-ware Design of Wave Index
Fig1. Architecture of RetroInfer
of RetroInfer is as follows:
➢ An Attention-aWare VEctor index named wave Index (Fig2), which adopts tripartite attention approximation
and accuracy-bound attention estimation to fit the dynamic sparsity of attention
➢ A wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU
and CPU to sustain high throughput.
Powered by the wave index and wave buffer, the decoding throughput of RetroInfer outperforms baselines by
4.5x-10.5x across different context lengths while it is the only solution that matches full attention accuracy.
Fig3. Model Accuracy and Decoding Throughput Comparison
54. Achievement
• Achieves 10x speedup for 1M tokens inference TTFT while maintaining full attention
performance;
• MInference accept by NeurIPS’24 as Spotlight, SCBench and MMInference
accept by ICLR’25 and ICML’25;
• RetrievalAttention awarded as Best Paper Award in ENLSP-IV @ NeurIPS'24;
• Adopted by vLLM and SGLang, two widely used business-level LLM inference
framework;
• Used by official services of open-sourced long context models, such as Qwen-Turbo-
1M and Seed.
https://github.com/sgl-project/sglang/pull/5327
https://github.com/vllm-project/flash-attention/pull/33
Qwen2.5-1M Technical Report
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
55. Future Direct
• Efficient Long Context Extension Method;
• Efficient Long Generation Method;
• Efficient RL Training;
• Native Efficient Architecture;
56.
57. 探索 AI 应用边界
Explore the limits of AI applications