破解大模型推理成本难题 YRCache 以存代算加速实践

如果无法正常显示，请先停止浏览器的去广告插件。

1. 破解模型推理成本难题 YRCache 以存代算加速实践张涛焱融科技CTO

3. 录 01 KVCache 技术背景和挑战 02 YRCache 多级缓存案 03 针对推理业务的加速实践效果 04 总结和未来展望

4. 01 KVCache 的技术背景和挑战

5. 推理优化的两个核率，提升系统总吞吐

6. KVCache 的原理和价值避免重复计算，提升计算效率

7. Prefix Cache | 优化 Prefill 阶段的计算效率 ✦ 相同前缀请求的 KVCache 是完全相同的，没必要重新计 Example 1: Shared system prompt Request A 算 A chat between a curious user and an arti cial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: Hello ✦ 对于 Agent/tools，有 K 到 K 度的共享系统 prompt，没必要重新计算 ✦ 多轮对话，为了让模型记住上下信息，需要保留历史对话，随着对话轮次变多，重复计算就越多，KVCache 被重的就越多， Prefix Cache的效果就越明显 ✦ 通过以存代算，能够节省量的计算资源，进推理能 Request B A chat between a curious user and an arti cial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. User: How are you? 提升整体 Example 2: Multi-round conversation Prompt (round 1) Human: What's AI? LLM Result (round 1) LLM: AI is technology that simulates human intelligence, like Siri or Google Maps. Prompt (round 2) Human: What's AI? LLM: AI is technology that simulates human intelligence, like Siri or Google Maps. Human: Cool, thanks! LLM Result (round 2) LLM: No problem! 共享部分

8. 临的挑战

9. 常的 KVCache Offloading 案 LMCache Mooncake Store Object Get/Put/Replicate /List/Del LLM Engine（with LMCache） Mooncake Managed Store Master GPU Memory Offload overflow KV BatchTransfer Read/Write/Flush On-demand reuse CPU DRAM Async write（LRU evict） Inference Server 1 Store Client Store Client Transfer Engine Transfer Engine RAM Segment RAM Segment VRAM/DRAM Paged KVCache Mem Cache Pool Managed Pool Buffer Other Cache Pool Managed Pool Buffer NVMeof Segment Fetch on reuse Object Store/PFS Remote Storage Backend ··· VRAM/DRAM Paged KVCache Managed Pool Buffer Async upload Prefetch hot KV disk Storage Backend Inference Server N ··· RPC Segment

10. 推理平台适配的挑战 vLLM推理平台适配 SGLang推理平台适配 vllm Scheduler Scheduler Module Runner RadixCache HiRadixCache Module Runner GPU VRAM HiCacheStorage HiCacheFile KVConnectorBase MultiConnector LMCacheConnector V1 NixlConnector ShareStorageConnector HiCacheHF3FS GPU VRAM LMCache MooncakeStore Local CPU DRAM redis In-Memory CPU NIXL Peer Local Disk Mooncake YRCache HiCacheNixl SGLang YRCache

11. KVCache Offloading 对 PD 分离架构的影响 KVCache Offloading 案的优劣势对 Prefill Request Decode Request prefill instance decoder instance 优势 ✦ 减少 P 节点 KVCache 显存驻留时间，将 KVCache Offloading 即可释放显存空间，需 P 节点等待 D 节点过来获取 KVCache Scheduler ✦ 解耦，PD 不需要互相可便于独 Scheduler 和通信，降低了节点间的耦合，（1）Query the whole input no cache found，preparing metadata for store（get_num_computed_blocks _and_prep_store）; 扩展和故障恢复 ✦ 容错性更好， (3)store_kv Model executor (2) external_computed_blocks=0 (4)Query input, found cache; 论是 P 还是 D 故障，KVCache 不会丢失 ✦ 成本共享存储，额外增加了，搭建和维护次IO，延迟会增加 KV connector KV&metadata transmission 个共享存储的成本和复杂度较 Shared Storage (3)load_kv (5)allocate_slots(num_compu ted_blocks+external_computed _blocks) (6)preparing metadata for Loading KV(with prep_load) 劣势 ✦ 引 Model executor KV connector

12. KVCache Offloading 更快的访问速度开箱即？更简单的平台适配

13. 02 YRCache 多级缓存案

14. YRCache 的整体架构设计 Inference GPU Worker L0 缓存 ✦ YRCache 整体设计 vLLM/SGLang/… VRAM YRConnector 是推理平台对接的connector层和缓 Connector层是以插件的形式嵌推理平台，调到 L1 缓存 DRAM PCle/NVMf Connector的查询接能的数据并发处理能 C++实现性 YRCache 可以 L2 缓存 Disk Disk Metric 故障容错负载均衡资源管控压缩特性 DRAM PCle/NVMf C++API Metric 故障容错负载均衡资源管控压缩特性 Disk 异步加载异步卸载灵活可配缓存策略 Disk Disk Disk 异步加载 Infiniband/RoCE/Ethernet L3 缓存 Shared Meta Store 异步卸载 C++线性并发能持多种存储介质做 C++API C++线性并发能，并提供C++ 为缓存层，灵活可配，可以任意组合 YRCache Engine 灵活可配缓存策略和Python两种接 4. 动态库调 YRCache Engine 如说Scheduler会直接 Cache Engine层采 3. YRConnector 动态库调存管理的Cache Engine层 2. vLLM/SGLang/… VRAM YRCache 逻辑上划分为两层，分别 1. Inference GPU Worker Shared Storage（YRCloudFile）

15. 缓存数据管理 CPU 内存 Block size default：16 VRAM KVCache Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block Block 多个不连续的数据迁移到连续的内存空间 Object Key • • • • • model name world size world id layer id chunk hash key Allocated Memory Bytes Block Value Key Value Key Value ··· Key Value Block Block Free Memory Bytes LRU Indexer Key Block Object • • • • • model shape dtype memory format physical address physical size Buddy CPU Memory Allocator

16. 缓存数据管理本地盘 hash key KVCache chunk（256 token） LRU Indexer Key Value Key Value Key Value ··· offset file Key KVCache File Value chunk persistent chunk length YRCache- 件池 YRCache-索引快照件列表 YRCache Metadata location value cache key • • • • • model name world size world id layer id chunk hash key • • • • • • • • • file id（关联个件） offset in file data length aligned length model shape dtype memory format physical address physical size

17. 缓存数据管理共享件存储 hash key KVCache chunk（256 token） YRCache Metadata • • • • • offset in file KVCache File location value cache key model name world size world id layer id chunk hash key • • • • • • • • • file id（关联个件） offset in file data length aligned length model shape dtype memory format physical address physical size chunk chunk length YRCache- 件池 redis

18. KVCache 读写流程写流程 Reuqest Tokens Token ✦ KVCache 缓存写流程在显存中申请 Block，计算成 1. KVCache 并写 Block 申请 CPU 内存，并把 KVCache 数据 2. 拷 ··· 1 VRAM KVCache Block Manager Token Token ··· 把数据写 4. 通知上层业务写完成 5. 异步其他存储层对应的 Chunk Object 式按需写 Token ··· Block Block Block ··· Block Block Block ··· 2 VRAM 4 到CPU 内存 3. Token 16 tokens KVCache per Block Memory LRU Indexer 3 Memory Object Store 256 tokens per Object chunk hash object pointer chunk hash object pointer chunk hash chunk hash object pointer chunk hash YRCache Lower Cache Layer Token Memory Pool DRAM 5 SSD/PFS

19. KVCache 读写流程缓存异步下刷流程 VRAM KVCache Block Manager ✦ KVCache 异步下刷流程 1. 将内存数据块写 2. 记录 3. 周期性写 4. 将数据透传给共享本地件索引信息到LRU Indexer 索引信息到磁盘写 Block ··· Block Block Block ··· Memory Object Store 256 tokens per Object chunk hash object pointer chunk hash object pointer chunk Object chunk hash object pointer chunk Object 共享存储件 5. Block Memory LRU Indexer 件系统件存储，写 Block 2 Memory LRU Indexer 完成后更新共享存储的全局索引信息 chunk hash file+offset+length chunk hash file+offset+length flush Indexer CheckPoint File 1 Local Disk Object Store File List Per Disk 3 Memory Pool Local Disk List 4 Shared Indexer 5 Shared Storage Object Store chunk hash file path+offset+length chunk hash file path+offset+length chunk hash file path+offset+length File List Per Disk Shared Storage

20. KVCache 读写流程缓存命中的读流程 Reuqest Tokens Token ··· 分配 VRAM Block 1. VRAM KVCache Block Manager 于存储 KVCache，同时查询 Prefix Cache，没有被 Prefix Cache 命中的 tokens 尝试从 YRCache 读取 YRCache 2. 先从内存缓存中查询，命中后则直接执第7 步内存缓存中没有命中则申请 3. 块 CPU 内存，然后把请求传递给本地盘缓存，本地盘缓存命中后把数据读内存并执 CPU 第6步本地盘缓存没有命中则把请求传递给共享存储缓存，命中 4. 后把数据读 CPU 内存置需求是否回写本地盘存储，同时把数据透传给上 7. Block Block Block ··· 7 Memory Object Store 256 tokens per Object chunk hash object pointer chunk Object chunk hash object pointer chunk Object Memory LRU Indexer chunk hash file+offset+length chunk hash file+offset+length Indexer CheckPoint File 3 chunk hash file path+offset+length File List Per Disk 4 Memory Pool 6 Local Disk Object Store Local Disk List 5 Shared Indexer 把命中的数据从内存拷 ··· ··· object pointer chunk hash file path+offset+length Token Block chunk hash chunk hash file path+offset+length Token Block Memory LRU Indexer 决定是否回写内存缓存池到 GPU 显存 Token 2 层本地盘把命中的数据返回给内存缓存，内存缓存根据配置 6. ··· Block 周期性下刷共享存储命中的数据返回本地盘存储，本地盘存储根据配 5. Token 16 tokens KVCache per Block 1 ✦ KVCache 缓存命中的读流程 Token YRCloudFile File List Per Disk Shared Storage

21. 缓存置换策略 ✦ 数据 • 命周期内存层根据配置的内存缓存容量来控制内存缓存层的数据量，超过阈值的数据根据 LRU 算法淘汰，内存缓存层淘汰出去的数据享 • 件系统本地盘层本地盘模式下，各个盘之间的数据独盘写 • 般都已经被持久化在 SSD、共管理，每个盘根据配置容量来控制磁数据量，超过阈值的数据将根据 LRU 算法淘汰共享存储层周期性的清理历史数据，根据配置定期清理段时间没有访问的历史数据 ✦ 灵活的缓存配置 YRCache 为每层的每个动作配置特定的数据流转策略 • Lookup/Get kStopAtCurrentLayer -- 只在当前缓存层查找，如果没命中则直接返回未命中 kTryFromLowerLayer -- 如果当前层没有命中，需要向更低层的缓存层查找 • Set kStopAtCurrentLayer -- 数据只缓存在当前层，不向更低层透传数据 kSetAlsoLowerLayer -- 数据缓存在当前层，同时向更低层透传数据

22. 性能件系统 YRCloudFile 整体架构

23. 焱融追光体机 F9000X 持 E3.S/ U.2 PCIe 5.0 TLC 和 QLC NVMe SSD 持 GDS 级特性 Multi Channel 性能优化

24. YRCache ✦ 对接多个推理平台套机制对接多个推理平台案 Scheduler 优势 Module Runner Scheduler GPU VRAM 1. 提供统的多级缓存管理 2. YRCache 代码便于维护劣势 1. 需要独维护适配patch 2. 版本兼容性的作量繁重 YRCacheConnector RadixCache KVConnectorBase _V1 ·worker-side： ·bind_connector_metadata ·clear_connector_metadata ·register_kv_caches ·start_load_kv ·wait_for_layer_load ·save_kv_layer ·wait_for_save ·get_finished ·scheduler-side ·get_num_new_matched_tokens ·update_state_after_alloc ·build_connector_meta ·request_finished LMCacheconnectorV1 Multiconnector HiCacheFile HiCacheHF3FS MooncakeStore HiCacheNixl SharedStorageConnector NixlConnector YRCacheConnector SGLang CPU DRAM GPU Server 1 CPU DRAM GPU Server 2 CPU DRAM GPU Server 3 CPU DRAM GPU Server 4 Local NVMe SSDs GPU Server 1 Local NVMe SSDs GPU Server 2 Local NVMe SSDs GPU Server 3 Local NVMe SSDs GPU Server 4 YRCloudFile HiCacheStorage Local CPU DRAM YRCache LMCache HiRadixCache GPU VRAM vLLM Module Runner

25. 03 针对推理业务的加速实践效果

26. 智能客服/多轮对话场景测试内容-多轮问答在相同 NVIDIA 4090*2 显卡配置下，对 Qwen3-32B 模型，通过 multi-round-qa.py 模拟多个模型进多轮交互，从户同时与分析服务引擎的吞吐量和延迟，在并发数递增时，分别使原 vLLM （PrefixCache）和 vLLM + YRCache 进测试。测试结论 • YRCache 在多轮对话场景，YRCache 命中缓存后，TTFT 稳定在2秒内，对原 vllm 时最短为 1/20, YRCache 整体命中率为 56%。

27. RAG知识库问答场景测试内容-知识库在相同 NVIDIA 4090*2 显卡配置下，对 Qwen3-32B 模型，通过 evalscope ，模拟知识库所需要输 toke n和输出1024 token 数量进递增时，分别使原 vLLM + YRCache 进 4096 并发测试，在并发数 vLLM（PrefixCache）和测试。测试结论 • 在全部并发过程中，在第轮缓存命中的情况下， YRCache 整体时降低 55%-80% 左右；

28. YRCache 与 LMCache 的性能对 vLLM 的性能最差，基本上没有 KVCache 命中论 YRCache 的内存模式、本地盘和远端共享存储模式均和 LMCache 性能上没有差异

29. YRCache 与 LMCache 的性能对轮命中 YRCache-Multi-SSD YRCache-ShareStorage

30. YRCache 与 LMCache 的性能对测试内容测试结论在相同 NVIDIA 4090*2 显卡配置下，对 Qwen3-32B 模型，通过 evalscope 进时，分别使原 YRCache 进的场景，第并发测试 • 内存模式下，LMCache 和 YRCache 相差不并发测试，在固定并发数量 • 在本地盘模式下，YRCache 的 TTFT 延迟是 LMCache 的 35% vLLM + LMCache 和 vLLM + • 在共享盘模式下，YRCache 的 TTFT 延迟是 LMCache 的 31% 测试，第轮模拟了KVCache不会命中轮模拟了KVCache缓存命中的场景 DeepSeek14B-单机打卡-4090- 并发测试 0.8 0.74 0.581 0.57 0.6 0.608 0.619 0.612 0.597 0.4 0.601 0.318 0.177 0.2 0.113 0.16 0.112 0.172 0 vLLM（Pre xCache） LMCache-Mem YRCache-Mem LMCache-SSD YRCache-Multi-SSD 缓存命中缓存未命中 YRCache-SSD YRCache-ShareStorage

31. 04 总结和未来展望

32. 总结热数据上浮

33. 未来规划

34.

35. THANKS 模型正在重新定义软件 Large Language Model Is Redefining The Software