基于DLRover的训练稳定性实践

1. 演讲人：马介悦

2. 01 Introduction 02 DLRover 03 Flash Checkpoint 04 XPUTimer 05 Open Source 06 Q&A

3.

4. 01

5. Training Data NA S OS S Modeling … Model size growing by scaling law Pretrain Finetuning Online learning Challenges for AI Infra from End to End AI APPs

6.

7.

8. 02

9.

10. • 弹性训练 • • 组网 &Precheck • 训练容错 • • • 资源管理

11. DLRover核心能力

12.

13.

14.

15.

16.

17. 03

18. 核心功能方案背景 • • • • • • • •

19. 异步持久化断点续存 • •

20. 04

21. 大规模训练疑难问题 XPUTimer核心能力

22. Error Algorithm Bugs Infra Bugs Start crash/hang, can’t finish one step • • • • OS Error Slowdown GPU Error Network Error Runtime crash/hang New algorithm Unnecessary synchronization Unoptimized kernel Slowdown compared to historical or prior training jobs Memory management GPU downgrade Network jitter Slowdown compared to historical or prior training steps

23. Error Hang Slowdown Training Process (Megatron, FSDP etc.) Event & Stack XPUTimer Timing Tracing Daemon Diagnostic Engine Hang-error diagnosis Slowdown diagnosis macro metrics micro metrics

24. Python GC Runtime Kernel cuBLAS Dataloader Synchronization FlashAttention etc Custom OP NCCL API intercept & timing Kernel intercept & event injection • Training Tracing thread Metric Prometheus thread Python runtime Intercept API Intercept API Intercept API • Timing manager Diagnostic engine Stack Timeline CUDA runtime Intercept kernel Intercept kernel Intercept kernel Recorded Event Queue Event Pool • •

25. • •

26. step FLOPS CPU Rank 0 Dataloade Thread GPU comp r Sync stream GPU comm stream CPU Comp latency m bandwidth CPU Rank 1 FLOPS GC Thread GPU comp GPU Comp Com m stream GPU comm stream empt y • • • • • Com inter-step bandwidth

27. • • •

28. •

29. 用户无感数据效率高低损耗，高精度轻量，友好

30. 04

31. https://github.com/intelligent-machine-learning/dlrover

32. https://lfaidata.foundation/projects/dl-rover/

33.

34.