以 KV 缓存为中心的高效长文本方法的优化和实践

1. 演讲人：姜慧强

2. o Research SDE in Microsoft Research Asia (Shanghai) ▪ System-Algorithm Co-design ▪ Efficient methods to accelerate inference/training

3. 01 长文本大语言模型的应用和推理挑战 02 当前主流推理优化方法与技术 03 以KV缓存为中心的大语言模型推理架构 04 05 以KV缓存为中心的高效长文本方法总结与展望

4.

5. 01

6. • Massive Pages of Docs • Extended Meeting Time • Lengthy Codebases Complex Reasoning Endless Agentic History Lifelong Personalization

7. • Almost all latest models can process contexts exceeding 100K tokens. 10M tokens ≈ PyTorch repository code ≈ Lord of the Rings trilogy (1ps) ≈ 500 reasoning iterative * https://lifearchitect.ai/models/#context-windows DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

8. ❑ Long Prefilling Latency, 30 minutes to process 1M tokens on an A100 for an 8B LLM. ❑ Large GPU Memory Consumption, 62GB of GPU memory is required for 512 K tokens in fp16. Long Prefilling Latency =>MInference Large GPU Memory Consumption =>RetrievalAttention

9. Prompts LLMLingua Series: Prompt compression MInference 1.0 /MMInference: Dynamic sparse prefilling Prefix Caching 1 Tokens gen. Compute Prefill Decode Retrieval Attention Keys & Values 3 Compress KV Cache Storage Alignment between ANNS and Attention Sparse Atten. 2 SCBench Explore bound of KV caching

10. 02 当前主流推理优化方法与技术

11.

12. Prompt Caching RadixAttention Context Caching Automatic Prefix Caching (a) Prefix caching is widely used in LLM framework. Prompt Caching (b) Prefix caching is widely used in LLM API.

13. 03 以KV缓存为中心的大语言模型推理架构

14.

15. ❑ Long-context methods are designed and utilized around the KV cache, but existing benchmarks focus only on single-request scenarios, ignoring its full lifecycle in real-world use. Repo-level Code Debugging/ Long- document QA Prompt Caching RadixAttention Context Caching Multi-turn Dialogue Self-play Reasoning (a) Long-Context is shared in real-world scenarios. Automatic Prefix Caching (b) Prefix caching is widely used in LLM framework. Prompt Caching (c) Prefix caching is widely used in LLM API.

16. ❑ Two typical shared context modes; ❑ Four category long-context capability;

17. ❑ 12 subtasks;

18. ❑ 13 long-context methods;

19. ❑ last window query region + A-shape;

20. ❑ Sub-O(n) Memory is Almost Infeasible in Multi-Turn Decoding. ❑ Long-generation scenarios exhibit distribution shift issues.

21.

22. 04 以KV缓存为中心的高效长文本方法

23. (a) Attention is sparse. (b) Sparsity of attention is dynamic. in prefilling • Figure. Dynamic sparsity in attention: (a) Top-k (k=4096) columns cover a significant portion of attention scores in a 128k context. (b) Fewer scores are retrieved when reusing top-k indices across examples, highlighting its dynamic nature. Visualizations use LLaMA-3-8B on a single A100. (c) Dynamically sparsity in decoding. in decoding Figure. (c) Dynamic sparsity in LLaMA-3-8B decoding in KV retrieval (100k tokens): Dynamically selecting top-1000 tokens achieves 89% recovery, while static selection drops to 71%.

24. ❑ After pretraining, attention exhibits various sparse patterns, including A-shape, Vertical-Slash, and block-sparse patterns. ❑ These sparse patterns are fixed for each head across different inputs. ❑ The specific sparse elements (e.g., column index, slash index) dynamically change depending on the context. Local Windows Sink Tokens Important Tokens RoPE/ N-gram Retrieval Tokens

25. • MInference utilizes the inherent dynamic sparsity found in LLMs alongside optimized GPU kernel design to enhance the TTFT. 02 03 Pattern Search Estimation Acceleration Kernel Aware Sparse Pattern Search Online Estimation of Sparsity Indices Dynamic Sparse Attention Calculation with PIT, FlashAttention 01 Step1: Offline Step2: Online

26. Latency at 1M tokens on a single A100 30 mins 10x 3 mins FlashAttn Minference GPUs needed for sub-20s latency at 1M tokens 60+ A100 8x 8 A100 FlashAttn Minference

27. • 1. NIAH 2. RULER Avg Tokens: 4K-128K 3. Latency Benchmark

28. ❑ Local tokens in temporal and spatial dimensions are evenly distributed within the attention map. ❑ Stride and starting position vary with context, the horizontal and vertical lines are evenly spaced and often symmetrical.

29. ❑ 1) Intramodality consistency; 2) Modality-separated continuity

30. MMInference

31. MMInference

32. MMInference: Grid Head in Multi-Modality

33. MMInference: Q-Boundary pattern

34. MMInference: 2D-Boundary pattern

35.

36.

37.

38.

39. ❑ The VS pattern shifts to a Grid pattern when the input transitions from text to visual.

40. SeerAttention: Learned Gated FlexPrefill: TopP SampleAttention: topP + column

41.

42. DiTFastAttn Sparse VideoGen STA AdaSpa SpargeAttn

43. • ANNS: Approximate Nearest Neighbor Search • Specifically, using inner-product as the distance metric, also known as Maximum Inner Product Search (MIPS). Inner- Product Attn 𝑞, 𝐾, 𝑉 = Softmax 𝑞𝐾 𝑑 𝑉 Smooth one- hot argmax

44. • Evaluation with RULER and ∞-Bench

45. ❑ Queries and keys have different distributions in attention. Off-the- shelf ANNS indexes perform poorly on 𝑄 → 𝐾 searches while work well for 𝐾 → 𝐾 searches. • about 30~50% of key vectors are required to scan to maintain an acceptable performance. (a) ANNS index performance. (b) Different distribution.

46. GPU requirement reduction by GPU-CPU Co-Execution: Lower-End GPU + CPU = High-End GPU ✓ Reduce memory access and data transfer with OOD-aware ANNS index ✓ Enable 128K inference on RTX 4090 with 5 tokens/second; Hot Tokens GPU KV KV KV Partial Attention Combine Offload Most KV to CPU Side Query Vector Retrieval Search CPU Dynamically Activated Tokens Indexed by ANNS index Partial Attention Nearest KV Tokens (dynamically retrieved) Attention Output

47. • One NVIDIA RTX 4090 (24GB) to handle 128K tokens for LLMs with 8B parameters, generating a token in 0.188 seconds. 40GB A100 24GB RTX4090

48.

49.

50. 05 总结与展望

51. SCBench: A KV Cache-Centric Analysis of Long-Context Methods Project Page Code Long-context methods are designed and utilized around the KV cache, but existing benchmarks focus only on single-request scenarios, ignoring its full lifecycle in real-world use. (b) Two Shared Context Modes We propose SCBench, a KV cache-centric benchmark for analyzing long- context methods, covering KV cache generation, compression, retrieval, and loading. It includes four capability tasks and two shared-context modes, from which we derive the following insights: ➢ Sub-O(n) memory is almost infeasible in multi-turn decoding; ➢ Task performance shows varying decline trends; ➢ All long-context methods experience performance degradation as the compression rate decreases; ➢ Long-generation scenarios exhibit distribution shift issues. (c) Overview of SCBench

52. MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Long-context enables powerful vision and multi-modal applications, but prefill cost remains a major bottleneck. (a) Grid pattern. We propose MMInference, a modality-aware permutation-based dynamic sparse attention method for multi-modal inputs. It introduces Grid-Shape and Q-/2D- Boundary sparse attention, achieving up to 8.3× speedup without sacrificing performance. MMInference addresses two key challenges:; ➢ Vision-specific inductive bias → handled via Grid-Shape attention; ➢ Modality boundary in mixed inputs → addressed by Q-/2D-Boundary attention; (c) Q-Boundary pattern. (b) Permuted Grid pattern. (d) 2D-Boundary pattern.

53. RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference We build KV Cache as a Vector Storage System in a CPU-GPU co-execution setup (Fig1) to acc-elerate long-context LLM inference without model accuracy loss (Fig3). The core Fig2. Attention-ware Design of Wave Index Fig1. Architecture of RetroInfer of RetroInfer is as follows: ➢ An Attention-aWare VEctor index named wave Index (Fig2), which adopts tripartite attention approximation and accuracy-bound attention estimation to fit the dynamic sparsity of attention ➢ A wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Powered by the wave index and wave buffer, the decoding throughput of RetroInfer outperforms baselines by 4.5x-10.5x across different context lengths while it is the only solution that matches full attention accuracy. Fig3. Model Accuracy and Decoding Throughput Comparison

54. Achievement • Achieves 10x speedup for 1M tokens inference TTFT while maintaining full attention performance; • MInference accept by NeurIPS’24 as Spotlight, SCBench and MMInference accept by ICLR’25 and ICML’25; • RetrievalAttention awarded as Best Paper Award in ENLSP-IV @ NeurIPS'24; • Adopted by vLLM and SGLang, two widely used business-level LLM inference framework; • Used by official services of open-sourced long context models, such as Qwen-Turbo- 1M and Seed. https://github.com/sgl-project/sglang/pull/5327 https://github.com/vllm-project/flash-attention/pull/33 Qwen2.5-1M Technical Report FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

55. Future Direct • Efficient Long Context Extension Method; • Efficient Long Generation Method; • Efficient RL Training; • Native Efficient Architecture;

56.

57. 探索 AI 应用边界 Explore the limits of AI applications