SGLang:高效的开源大规模LLM服务框架
如果无法正常显示,请先停止浏览器的去广告插件。
1. 演讲人:尹良升
2. 01 SGLang Milestones and Features Overview
02 Speculative Decoding and Constrained Decoding in SGLang
03 Efficient Design and Implementation of PD Disaggregation
04 Large-scale EP Support for DeepSeek Blog Reproduction
05 Hierarchical Caching Design in SGLang
06 The Ecosystem of SGLang
3.
4. preencoded.png
• SGLang is a fast-serving engine for LLMs and VLMs.
• Among fully open-source LLM Inference Engines, SGLang currently achieves state-of-the-art (SOTA)
performance, and It is the first open-source implementation to nearly match the throughput reported in the
official DeepSeek blog at a large scale.
• Meanwhile, its elegant, lightweight, and customizable design has attracted wide adoption from academics, big
tech companies, and startups. (XAI, NVIDIA, AMD, Baseten, Microsoft, Linkedin, etc.)
• In on-policy RLHF, inference engines are crucial for efficient policy model execution, and SGLang excels as a
high-performance solution.
5. 01
SGLang Milestones
and Features Overview
6. SGLang Milestones and Features
• 2023/12-2024/02: Initial Motivation, Structured LM Programming, Prefix Caching, and Constrained Decoding
• 2024/07: Leading Performance among inference engines on Llama3
• 2024/09: v0.3 Release, 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision
• 2024/12: v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware DP Router, X-Grammar Integration, The First
to Serve DeepSeek V3.
• 2025/01: SGLang provides day-one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-
specific optimizations. ( 10+ companies!)
• 2025/05
First open-source implementation of DeepSeek V3/R1 expert parallelism with prefill-decode
disaggregation. Achieves 52.3K in-tok/s, 22.3K out-tok/s on 96 GPUs—5 faster than vanilla TP.
• SGLang has seen extensive adoption and serves as the dominant inference engine for AMD and the default inference
engine for xAI.
7. RadixAttention handles complex reuse patterns
RadixAttention enables
efficient prefix matching,
insertion, and eviction.
It handles trees with hundreds
of thousands of tokens.
8. CPU Overhead with Normal Scheduler
9. Overlap Scheduler to Eliminate CPU Overhead
10. 02
Speculative Decoding and
Constrained Decoding in SGLang
11. Speculative Decoding of Eagle
• Eagle is a SOTA speculative decoding algorithm.
• In SGLang, both Eagle-2 & Eagle-3 speculative decoding are supported. We are the first to
support EAGLE-3 in collaboration with the EAGLE team.
• Achieved 1.6x decoding speedup for EAGLE-2 and 2.4x decoding speedup for EAGLE-3 on
Llama 3.1 8B with a single request.
12. Speculative Decoding of MTP
• Description: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on
EAGLE speculative decoding.
• With this optimization, the decoding speed can be improved by 1.8x for batch size 1 and
1.5x for batch size 32, respectively, on the H200 TP8 setting.
13. Zero-Overhead Integration with XGrammar
• SGLang + XGrammar can be up to 10x faster than other open-source solutions for JSON
decoding tasks.
• https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-
xgrammar.
14. 03
Efficient Design and
Implementation of PD Disaggregation
15. Issues with Non-Disaggregation
• Prefill Interruption: Prefill batches often preempt ongoing decode tasks, delaying
token generation.
• DP Attention Imbalance: DP workers may handle prefill and decode simultaneously,
causing load imbalance and increased latency.
• Incompatible with DeepEP: Prefill and decode use different dispatch modes. Without
disaggregation, DeepEP cannot support both within the same communication group
under DP attention.
16. PD Disaggregation Architecture Design
•Unified Load Balancer for both prefill and decode paths.
•LB is decoupled from computation logic: requests are
sent to LB and then routed to a selected PD pair.
•KV transfer supports non-blocking and RDMA-based
transfer.
•SGLang offers flexible API integration like NIXL and
Mooncake.
17. PD Disaggregation Timestamp
Prefill
Instance
Sender Init
Handshake
Decode
Instance
Prefill Forward
Notify
Receiver Init Pre-allocation
KV
Transfer
Decode Forward
18. 04
Large-scale EP Support for
DeepSeek Blog Reproduction
19. Parallelism Strategies with Dense FFN
• Enhanced Scalability: Avoids TP fragmentation on
large hidden dims (e.g., 18432), ensuring better
alignment and utilization.
• Optimized Memory Efficiency: Prefill & decode
phases both benefit from low TP degrees under DP
attention, reducing per-device memory.
• Minimized Communication Overhead: Replaces
two all-reduces (in TP) with one reduce-scatter + one
all-gather.
20. Parallelism Strategies with Sparse FFN (MoE)
• Scalable Model Capacity: Expert weights are
partitioned across devices using Expert Parallelism,
removing memory bottlenecks.
• Optimized Communication: Follows a Dispatch →
Expert → Combine pattern, powered by DeepEP and
Two-Batch-Overlap to minimize latency and overhead.
• Addressing Load Imbalance: EP introduces variability
in routing; EPLB and DeepEP optimize for workload
distribution.
21. Compatibility Issue with DeepEP
DeepEP holds two different dispatch modes
• Normal: Prefill-friendly, but no CUDA Graph.
• Low-Latency: Decode-friendly, supports CUDA Graph.
• Auto: Handles both input/output, but is incompatible with DP Attention with unified scheduling
(non-disaggregation).
• PD Disaggregation resolves DeepEP Dispatch & DP Attention incompatibility.
22. Improper Launch Order of TBO
• TBO: Communication and computation are expected to be executed simultaneously.
• Dispatch brings synchronization, which blocks the CPU until the GPU receives metadata
(required for allocating correctly sized tensors).
• Improper launch order, e.g., dispatch before MLP, will block the launching and leave the
computation stream idle.
23. Proper Launch Order of TBO
• Proper launch order: submitting computation tasks to the GPU before launching CPU-
blocking communication.
• Computation → Communication: enabling the GPU to remain active during communication.
24. Clean Implementation: Two-Batch-Overlap
• Abstracted execution via operation list + yield points, enabling cooperative scheduling.
• Eliminates code duplication and reduces the need for variable post-fixes.
• Efficiently manages partial completion at layer boundaries.
25. Throughput Performance
Throughputs of prefill (P) and decode (D) phases are evaluated independently, assuming unlimited resources for the
non-tested phase to isolate and maximize the load on the tested nodes, mirroring the setup used by DeepSeek.
26. Expert Parallelism Load Balancer
Real-World Serving Challenges
• Imbalance worsens at scale, with expert usage skewing causing idle GPU time.
Strategies to Improve Balance
• Larger Batch Sizes: Reduces randomness in expert routing; enabled via cluster scaling or Multi-
Token Prediction (MTP).
• Periodic Rebalancing: Adapts to input shifts over time; requires low-cost expert weight reloads.
SGLang Implementation
• Exchange expert weights with torch P2P operations.
27. Effects of Scale and EPLB to Balancedness
• Balancedness: the ratio between mean computation
time and maximum computation time for a MoE layer
among GPUs.
• Balancedness decreases when the system scales with
the number of nodes.
• Enabling EPLB significantly improves the balance.
28. 05
Hierarchical Caching Design
in SGLang
29. Effects of Scale and EPLB to Balancedness
• Versatile control plane achieves resource-aware
scheduling and effective latency hiding.
• Efficient I/O data plane using specialized CUDA
kernels.
• Ongoing storage layers integration (Generic APIs
compatible with popular storage layers like
Mooncake).
30. 06
The Ecosystem of SGLang
31. About SGLang Team
• SGLang Team is incubated by LMSYS Org
• Major Maintainers: Lianmin Zheng, Ying Sheng, Liangsheng Yin, Yineng Zhang, Ke Bao, Byron Hsu,
Chenyang Zhao, Zhiqiang Xie, Jingyi Chen, Xiaoyu Zhang, Baizhou Zhang, Yi Zhang, Jiexin Liang,
Chang Su, Hai Xiao.
• Contributors: 400+
32. Community Adoptions
33.
34.