基于DLRover的训练稳定性实践

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. 演讲人:马介悦
2. 01 Introduction 02 DLRover 03 Flash Checkpoint 04 XPUTimer 05 Open Source 06 Q&A
3.
4. 01
5. Training Data NA S OS S Modeling … Model size growing by scaling law Pretrain Finetuning Online learning Challenges for AI Infra from End to End AI APPs
6.
7.
8. 02
9.
10. • 弹性训练 • • 组网 &Precheck • 训练容错 • • • 资源管理
11. DLRover核心能力
12.
13.
14.
15.
16.
17. 03
18. 核心功能 方案背景 • • • • • • • •
19. 异步持久化 断点续存 • •
20. 04
21. 大规模训练疑难问题 XPUTimer核心能力
22. Error Algorithm Bugs Infra Bugs Start crash/hang, can’t finish one step • • • • OS Error Slowdown GPU Error Network Error Runtime crash/hang New algorithm Unnecessary synchronization Unoptimized kernel Slowdown compared to historical or prior training jobs Memory management GPU downgrade Network jitter Slowdown compared to historical or prior training steps
23. Error Hang Slowdown Training Process (Megatron, FSDP etc.) Event & Stack XPUTimer Timing Tracing Daemon Diagnostic Engine Hang-error diagnosis Slowdown diagnosis macro metrics micro metrics
24. Python GC Runtime Kernel cuBLAS Dataloader Synchronization FlashAttention etc Custom OP NCCL API intercept & timing Kernel intercept & event injection • Training Tracing thread Metric Prometheus thread Python runtime Intercept API Intercept API Intercept API • Timing manager Diagnostic engine Stack Timeline CUDA runtime Intercept kernel Intercept kernel Intercept kernel Recorded Event Queue Event Pool • •
25. • •
26. step FLOPS CPU Rank 0 Dataloade Thread GPU comp r Sync stream GPU comm stream CPU Comp latency m bandwidth CPU Rank 1 FLOPS GC Thread GPU comp GPU Comp Com m stream GPU comm stream empt y • • • • • Com inter-step bandwidth
27. • • •
28. •
29. 用户无感 数据效率高 低损耗,高精度 轻量,友好
30. 04
31. https://github.com/intelligent-machine-learning/dlrover
32. https://lfaidata.foundation/projects/dl-rover/
33.
34.

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.146.0. UTC+08:00, 2025-10-20 08:59
浙ICP备14020137号-1 $访客地图$