从 AI 平台演进获得的十点架构启示

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. 从 AI 平台演进 获得的十点架构启示 Google Cloud / 王顺 AI/ML专家
2.
3. 王顺 Google Cloud AI/ML专家 2018 年 7 月加入 Google 助力客户AI/ML训练和推理 ’ s A I t ec hnol ogi es t o wor k
4.
5. 1. 变与不变: 训练和推理是AI的两大核心任务
6.
7. AI Accelerators
8. 2. 合二为一: AutoML和定制化训练SDK统一
9.
10.
11. 3. 敏捷开发(CI/CT/CD/CM): 可持续的集成/训练/部署/监控
12. ML De v e l o p me n t C C o d e & o n f i g T r a i n i n g O p e r a t i o n a l i z a t i o n T r a i n i n g P i p e l i n e 工作流示意 Da t a & M o d e l M a n a g e m e n t Co n t i n u o u s T r a i n i n g R e g i s t e r e d M o d e l M o d e l De p l o y m e n t S e r v i n g P a c k a g e P r e d i c t i o n S e r v i n g S e r v i n g L o g s Co n t i n u o u s M o n i t o r i n g
13.
14. 4. 用户驱动: 托管ScaNN满足企业客户需求
15.
16.
17. 5. 海纳百川: PyTorch和TF框架相同优先级
18.
19. PyTorch on Google Cloud 2018 2020 2021 在DLVM中官方支持 Cloud TPU中支持 Vertex AI官方提供预安装 PyTorch PyTorch/XLA PyTorch的容器选项
20.
21. 6. 出类拔萃: NAS搜索SOTA网络结构
22. https://paperswithcode.com/sota/image-classification-on-imagenet
23.
24. Image recognition
25.
26.
27. pyglove Open sourced: https://github.com/google/pyglove
28. 7. 脱颖而出: Reduction Server提高分布式训练效率
29. Ring All-Reduce Worke r (Ring 0ll - e d uce)
30. GPUs on GCP
31. Parameter Server ● He t r oge n e ous ● GPU worke r s + SPU s e r ver ● Push grad ien t ● Pull parame t r
32. Proprietary + Confidential Revisiting Parameter Server architecture: i Each worker only transfers same amount of input data over network Reduction Servers: High-bandwidth low-cost CPU-only VMs for reduction. Higher perf/TCO: Trading extra CPU costs for higher performance
33. a f a / d d in g 2 0 r e d u c tio n s e r v e r n o d e s in c r e as e tr ain in g th r o u g h p u t b y 7 5 % th e d u c e d c o s t p e r s te ad d itio n al n o d e s TensorFlow Model Garden, BERT-large MNLI finetune, Workers: a2-highgpu-8 (NVIDIA A100) x 8, Reducers: n1-highcpu-16 x 20 https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai y 4 2 v e it
34. 8. 八面玲珑: Twitter、Spotify等客户横跨多行业
35. source link: https://www.youtube.com/watch?v=N9ufw8uP_8s
36.
37. source: https://engineering.atspotify.com/2022/03/introducing-natural-language-search-for-podcast-episodes/
38. 9. 独孤九剑: 覆盖 AI/ML 全生命周期
39. AI Accelerators
40.
41. 跨所有技术水平协作 flexible tools for collaboration across all levels of technical expertise
42. 10. 继往开来: JAX和Pathways定义下一代框和平台
43. What is JAX import jax.numpy as np from jax import jit, grad, vmap def predict(params, inputs): for W, b in params: outputs = np.dot(inputs, W) + b inputs = np.tanh(outputs) return outputs def loss(params, batch): inputs, targets = batch preds = predict(params, inputs) return np.sum((preds - targets) ** 2) gradient_fun = jit(grad(loss)) perexample_grads = jit(vmap(grad(loss), (None, 0))) JAX is an extensible system for composable function transformations of Python+NumPy code.
44.
45. X Model Parallelism Data- parallel split gpu:0 gpu:4 gpu:1 Model Model gpu:2 gpu:5 gpu:6 gpu:3 gpu:7 All Reduce S patial P artitioning M odel 7 ecom position Gradient update M o d e l c o d e n e e d s to b e m o d e l- p a ra lle l a a re 7 iffic u lt to im p le m e n t Model P ro tip # 1 : s c a le u p b e fo re o u t ( A 1 0 0 P ro tip # 2 : u s e re d u c e p re c is io n ( : P 1 6 , T : 3 2 , B : 1 6 gpu:0 gpu:1 gpu:2 gpu:3
46. GPipe: Pipeline Parallelism GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism: 1811.06965 t t p s : / / g it h u b . c o m / t e n s o r f lo w / lin g v o / b lo b / m as t e r / lin g v o / c o r e / g p ip e . p
47. GShard/GSPMD
48.
49.
50.
51.

- 위키
Copyright © 2011-2025 iteam. Current version is 2.139.1. UTC+08:00, 2025-01-17 14:06
浙ICP备14020137号-1 $방문자$