弹性分布式训练与XGBoost on Ray

Elastic Distributed Training with XGBoost on Ray

Since we productionized distributed XGBoost on Apache Spark™ at Uber in 2017, XGBoost has powered a wide spectrum of machine learning (ML) use cases at Uber, spanning from optimizing marketplace dynamic pricing policies for Freight, improving times of arrival (ETA) estimation, fraud detection and prevention, to content discovery and recommendation for Uber Eats.

自从2017年我们在Uber的Apache Spark™上生产了分布式XGBoost后，XGBoost已经为Uber的各种机器学习（ML）用例提供了支持，包括优化货运的市场动态定价策略、改善到达时间（ETA）的估计、欺诈检测和预防，以及Uber Eats的内容发现和推荐。

However, as Uber has scaled, we have started to run distributed training jobs with more data and workers, and more complex distributed training patterns have become increasingly common. As such, we have observed a number of challenges for doing distributed machine learning and deep learning at scale:

然而，随着Uber规模的扩大，我们开始用更多的数据和工人来运行分布式训练工作，更复杂的分布式训练模式也变得越来越普遍。因此，我们观察到了大规模进行分布式机器学习和深度学习的一些挑战。

Fault Tolerance and Elastic Training: As distributed XGBoost jobs use more data and workers, the probability of machine failures also increases. As most distributed execution engines including Spark are stateless, common fault-tolerance mechanisms using frequent checkpointing still require external orchestration and trigger data reloads on workers. This incurs significant per-worker data shuffling, serialization, and loading overheads.
容错和弹性训练。随着分布式XGBoost作业使用更多的数据和工作者，机器发生故障的概率也会增加。由于包括Spark在内的大多数分布式执行引擎是无状态的，使用频繁检查点的常见容错机制仍然需要外部协调，并在工作器上触发数据重新加载。这就产生了每个工作器的大量数据洗牌、序列化和加载开销。
Distributed Hyperparameter Search and Complex Compute Patterns: Emergent training patterns are complex and require higher-level orchestration systems to schedule and coordinate the distributed execution of parallel, multi-node distributed training jobs with dynamic resource allocation requirements. This introduces significant resource and scheduling overhead on top of data loading and model checkpoint management costs.
分布式超参数搜索和复杂的计算模式。新出现的训练模式很复杂，需要更高级别的协调系统来调度和协调具有动态资源分配要求的平行、多节点分布式训练工作的分布式执行。这在数据加载和模型检查点管理成本之外，还引入了大量的资源和调度开销。
Need for a unified compute ba...