实际业务场景,用户通常会使用train_and_evaluate模式,在跑训练任务的过程中同时评估模型效果。上了Booster架构后,由于训练跑的太快,导致Evaluate速度跟不上训练正常产出Checkpoint的速度。我们在GPU训练架构的基础上,支持了Evaluate on GPU的能力,业务可以申请一颗A100 GPU专门用来做Evaluate,单颗GPU做Evaluate的速度是CPU场景下单个Evaluate进程的40倍。同时,我们也支持了Predict on GPU的能力,单机八卡Predict的速度是同等成本下CPU的3倍。此外,我们在任务资源配置上也提供了比较完善的选项。在单机八卡(A100单台机器至多配置8张卡)的基础上,我们支持了单机单卡、双卡、四卡任务,并打通了单机单卡/双卡/四卡/八卡/CPU PS架构的Checkpoint,使得用户能够在这几种训练模式间自由切换、断点续训,方便用户选择合理的资源类型、资源量跑实验,同时业务也能够从已有模型的Checkpoint来WarmStart训练新的模型。
[1] https://tech.meituan.com/2021/12/09/meituan-tensorflow-in-recommender-systems.html[2] https://images.nvidia.cn/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf[3] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf[4]https://github.com/NVIDIA-Merlin/HugeCTR[5] https://en.wikipedia.org/wiki/Nvidia_Tesla[6] https://www.nvidia.com/en-us/data-center/dgx-a100[7] https://github.com/horovod/horovod[8] https://github.com/NVIDIA/nccl[9] https://www.tensorflow.org/xla[10] Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In NIPS, pp. 598–605. Morgan Kaufmann, 1989.[11] Kenji Suzuki, Isao Horiba, and Noboru Sugie. A simple neural network pruning algorithm with application to filter synthesis. Neural Process. Lett., 13(1):43–53, 2001.[12] Suraj Srinivas and R. Venkatesh Babu. Data-free parameter pruning for deep neural networks. In BMVC, pp. 31.1–31.12. BMVA Press, 2015.[13] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.[14] Hao-Jun Michael Shi, Dheevatsa Mudigere, Maxim Naumov, and Jiyan Yang. Compositional embeddings using complementary partitions for memory-efficient recommendation systems. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 165-175. 2020.[15] https://mp.weixin.qq.com/s/fOA_u3TYeSwAeI6C9QW8Yw[16] Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems. arXiv preprint arXiv:1803.05170 (2018).[17] Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang. Autoint: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161-1170. 2019.[18] Guorui Zhou, Weijie Bian, Kailun Wu, Lejian Ren, Qi Pi, Yujing Zhang, Can Xiao et al. CAN: revisiting feature co-action for click-through rate prediction. arXiv preprint arXiv:2011.05625 (2020).[19] Chun-Hao Chang, Ladislav Rampasek, and Anna Goldenberg. Dropout feature ranking for deep learning models. arXiv preprint arXiv:1712.08645 (2017).[20] Xu Ma, Pengjie Wang, Hui Zhao, Shaoguo Liu, Chuhan Zhao, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. Towards a Better Tradeoff between Effectiveness and Efficiency in Pre-Ranking: A Learnable Feature Selection based Approach. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2036-2040. 2021.[21] Bencheng Yan, Pengjie Wang, Jinquan Liu, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. Binary Code based Hash Embedding for Web-scale Applications. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3563-3567. 2021.[22] Xiangyu Zhao, Haochen Liu, Hui Liu, Jiliang Tang, Weiwei Guo, Jun Shi, Sida Wang, Huiji Gao, and Bo Long. Autodim: Field-aware embedding dimension searchin recommender systems. In Proceedings of the Web Conference 2021, pp. 3015-3022. 2021.[23] Bencheng Yan, Pengjie Wang, Kai Zhang, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. Learning Effective and Efficient Embedding via an Adaptively-Masked Twins-based Layer. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pp. 3568-3572. 2021.[24] Ting Chen, Lala Li, and Yizhou Sun. Differentiable product quantization for end-to-end embedding compression. In International Conference on Machine Learning, pp. 1617-1626. PMLR, 2020.[25] Wang-Cheng Kang, Derek Zhiyuan Cheng, Ting Chen, Xinyang Yi, Dong Lin, Lichan Hong, and Ed H. Chi. Learning multi-granular quantized embeddings for large-vocab categorical features in recommender systems. In Companion Proceedings of the Web Conference 2020, pp. 562-566. 2020.[26] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836 (2016).[27] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems 30 (2017).[28] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).[29] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A large mini-batch object detector. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6181-6189. 2018. ---------- END ---------- 也许你还想看 | TensorFlow在推荐系统中的分布式训练优化实践 |基于TensorFlow Serving的深度学习在线预估 |使用TensorFlow训练WDL模型性能问题定位与调优