SGLang：高效的开源大规模LLM服务框架

1. 演讲人：尹良升

2. 01 SGLang Milestones and Features Overview 02 Speculative Decoding and Constrained Decoding in SGLang 03 Efficient Design and Implementation of PD Disaggregation 04 Large-scale EP Support for DeepSeek Blog Reproduction 05 Hierarchical Caching Design in SGLang 06 The Ecosystem of SGLang

3.

4. preencoded.png • SGLang is a fast-serving engine for LLMs and VLMs. • Among fully open-source LLM Inference Engines, SGLang currently achieves state-of-the-art (SOTA) performance, and It is the first open-source implementation to nearly match the throughput reported in the official DeepSeek blog at a large scale. • Meanwhile, its elegant, lightweight, and customizable design has attracted wide adoption from academics, big tech companies, and startups. (XAI, NVIDIA, AMD, Baseten, Microsoft, Linkedin, etc.) • In on-policy RLHF, inference engines are crucial for efficient policy model execution, and SGLang excels as a high-performance solution.

5. 01 SGLang Milestones and Features Overview

6. SGLang Milestones and Features • 2023/12-2024/02: Initial Motivation, Structured LM Programming, Prefix Caching, and Constrained Decoding • 2024/07: Leading Performance among inference engines on Llama3 • 2024/09: v0.3 Release, 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision • 2024/12: v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware DP Router, X-Grammar Integration, The First to Serve DeepSeek V3. • 2025/01: SGLang provides day-one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek- specific optimizations. ( 10+ companies!) • 2025/05 First open-source implementation of DeepSeek V3/R1 expert parallelism with prefill-decode disaggregation. Achieves 52.3K in-tok/s, 22.3K out-tok/s on 96 GPUs—5 faster than vanilla TP. • SGLang has seen extensive adoption and serves as the dominant inference engine for AMD and the default inference engine for xAI.

7. RadixAttention handles complex reuse patterns RadixAttention enables efficient prefix matching, insertion, and eviction. It handles trees with hundreds of thousands of tokens.

8. CPU Overhead with Normal Scheduler

9. Overlap Scheduler to Eliminate CPU Overhead

10. 02 Speculative Decoding and Constrained Decoding in SGLang

11. Speculative Decoding of Eagle • Eagle is a SOTA speculative decoding algorithm. • In SGLang, both Eagle-2 & Eagle-3 speculative decoding are supported. We are the first to support EAGLE-3 in collaboration with the EAGLE team. • Achieved 1.6x decoding speedup for EAGLE-2 and 2.4x decoding speedup for EAGLE-3 on Llama 3.1 8B with a single request.

12. Speculative Decoding of MTP • Description: SGLang implements DeepSeek V3 Multi-Token Prediction (MTP) based on EAGLE speculative decoding. • With this optimization, the decoding speed can be improved by 1.8x for batch size 1 and 1.5x for batch size 32, respectively, on the H200 TP8 setting.

13. Zero-Overhead Integration with XGrammar • SGLang + XGrammar can be up to 10x faster than other open-source solutions for JSON decoding tasks. • https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with- xgrammar.

14. 03 Efficient Design and Implementation of PD Disaggregation

15. Issues with Non-Disaggregation • Prefill Interruption: Prefill batches often preempt ongoing decode tasks, delaying token generation. • DP Attention Imbalance: DP workers may handle prefill and decode simultaneously, causing load imbalance and increased latency. • Incompatible with DeepEP: Prefill and decode use different dispatch modes. Without disaggregation, DeepEP cannot support both within the same communication group under DP attention.

16. PD Disaggregation Architecture Design •Unified Load Balancer for both prefill and decode paths. •LB is decoupled from computation logic: requests are sent to LB and then routed to a selected PD pair. •KV transfer supports non-blocking and RDMA-based transfer. •SGLang offers flexible API integration like NIXL and Mooncake.

17. PD Disaggregation Timestamp Prefill Instance Sender Init Handshake Decode Instance Prefill Forward Notify Receiver Init Pre-allocation KV Transfer Decode Forward

18. 04 Large-scale EP Support for DeepSeek Blog Reproduction

19. Parallelism Strategies with Dense FFN • Enhanced Scalability: Avoids TP fragmentation on large hidden dims (e.g., 18432), ensuring better alignment and utilization. • Optimized Memory Efficiency: Prefill & decode phases both benefit from low TP degrees under DP attention, reducing per-device memory. • Minimized Communication Overhead: Replaces two all-reduces (in TP) with one reduce-scatter + one all-gather.

20. Parallelism Strategies with Sparse FFN (MoE) • Scalable Model Capacity: Expert weights are partitioned across devices using Expert Parallelism, removing memory bottlenecks. • Optimized Communication: Follows a Dispatch → Expert → Combine pattern, powered by DeepEP and Two-Batch-Overlap to minimize latency and overhead. • Addressing Load Imbalance: EP introduces variability in routing; EPLB and DeepEP optimize for workload distribution.

21. Compatibility Issue with DeepEP DeepEP holds two different dispatch modes • Normal: Prefill-friendly, but no CUDA Graph. • Low-Latency: Decode-friendly, supports CUDA Graph. • Auto: Handles both input/output, but is incompatible with DP Attention with unified scheduling (non-disaggregation). • PD Disaggregation resolves DeepEP Dispatch & DP Attention incompatibility.

22. Improper Launch Order of TBO • TBO: Communication and computation are expected to be executed simultaneously. • Dispatch brings synchronization, which blocks the CPU until the GPU receives metadata (required for allocating correctly sized tensors). • Improper launch order, e.g., dispatch before MLP, will block the launching and leave the computation stream idle.

23. Proper Launch Order of TBO • Proper launch order: submitting computation tasks to the GPU before launching CPU- blocking communication. • Computation → Communication: enabling the GPU to remain active during communication.

24. Clean Implementation: Two-Batch-Overlap • Abstracted execution via operation list + yield points, enabling cooperative scheduling. • Eliminates code duplication and reduces the need for variable post-fixes. • Efficiently manages partial completion at layer boundaries.

25. Throughput Performance Throughputs of prefill (P) and decode (D) phases are evaluated independently, assuming unlimited resources for the non-tested phase to isolate and maximize the load on the tested nodes, mirroring the setup used by DeepSeek.

26. Expert Parallelism Load Balancer Real-World Serving Challenges • Imbalance worsens at scale, with expert usage skewing causing idle GPU time. Strategies to Improve Balance • Larger Batch Sizes: Reduces randomness in expert routing; enabled via cluster scaling or Multi- Token Prediction (MTP). • Periodic Rebalancing: Adapts to input shifts over time; requires low-cost expert weight reloads. SGLang Implementation • Exchange expert weights with torch P2P operations.

27. Effects of Scale and EPLB to Balancedness • Balancedness: the ratio between mean computation time and maximum computation time for a MoE layer among GPUs. • Balancedness decreases when the system scales with the number of nodes. • Enabling EPLB significantly improves the balance.

28. 05 Hierarchical Caching Design in SGLang

29. Effects of Scale and EPLB to Balancedness • Versatile control plane achieves resource-aware scheduling and effective latency hiding. • Efficient I/O data plane using specialized CUDA kernels. • Ongoing storage layers integration (Generic APIs compatible with popular storage layers like Mooncake).

30. 06 The Ecosystem of SGLang

31. About SGLang Team • SGLang Team is incubated by LMSYS Org • Major Maintainers: Lianmin Zheng, Ying Sheng, Liangsheng Yin, Yineng Zhang, Ke Bao, Byron Hsu, Chenyang Zhao, Zhiqiang Xie, Jingyi Chen, Xiaoyu Zhang, Baizhou Zhang, Yi Zhang, Jiexin Liang, Chang Su, Hai Xiao. • Contributors: 400+

32. Community Adoptions

33.

34.