基于DLRover的训练稳定性实践
如果无法正常显示,请先停止浏览器的去广告插件。
1. 演讲人:马介悦
2. 01 Introduction
02 DLRover
03 Flash Checkpoint
04 XPUTimer
05 Open Source
06 Q&A
3.
4. 01
5. Training Data
NA
S
OS
S
Modeling
…
Model size growing by scaling law
Pretrain
Finetuning
Online learning
Challenges for AI Infra from End to End
AI APPs
6.
7.
8. 02
9.
10. •
弹性训练
•
•
组网
&Precheck
•
训练容错
•
•
•
资源管理
11. DLRover核心能力
12.
13.
14.
15.
16.
17. 03
18. 核心功能
方案背景
•
•
•
• •
• •
•
19. 异步持久化 断点续存
• •
20. 04
21. 大规模训练疑难问题
XPUTimer核心能力
22. Error
Algorithm
Bugs
Infra
Bugs
Start crash/hang,
can’t finish one step
•
•
•
•
OS
Error
Slowdown
GPU
Error
Network
Error
Runtime crash/hang
New
algorithm
Unnecessary
synchronization
Unoptimized
kernel
Slowdown compared to historical or prior
training jobs
Memory
management
GPU
downgrade
Network
jitter
Slowdown compared to historical or prior
training steps
23. Error
Hang
Slowdown
Training Process
(Megatron, FSDP etc.)
Event &
Stack
XPUTimer
Timing
Tracing Daemon
Diagnostic Engine
Hang-error diagnosis
Slowdown diagnosis
macro
metrics
micro
metrics
24. Python
GC
Runtime
Kernel
cuBLAS
Dataloader
Synchronization
FlashAttention
etc
Custom
OP
NCCL
API intercept
& timing
Kernel intercept
& event
injection
•
Training
Tracing thread
Metric
Prometheus
thread
Python
runtime
Intercept
API
Intercept
API
Intercept
API
•
Timing
manager
Diagnostic
engine
Stack
Timeline
CUDA
runtime
Intercept
kernel
Intercept
kernel
Intercept
kernel
Recorded
Event Queue
Event Pool
•
•
25. •
•
26. step
FLOPS
CPU
Rank
0
Dataloade
Thread
GPU comp
r
Sync
stream
GPU comm
stream
CPU
Comp
latency
m
bandwidth
CPU
Rank
1
FLOPS
GC
Thread
GPU comp
GPU
Comp
Com
m
stream
GPU comm
stream
empt
y
• •
• •
•
Com
inter-step
bandwidth
27. •
•
•
28. •
29. 用户无感
数据效率高
低损耗,高精度
轻量,友好
30. 04
31. https://github.com/intelligent-machine-learning/dlrover
32. https://lfaidata.foundation/projects/dl-rover/
33.
34.