Uber在Kubernetes上的Ray之旅：Ray设置

Uber’s taken steps to enhance and modernize its machine learning platform. As part of this enhancement, in early 2024, Uber migrated its machine learning workloads to Kubernetes®. This blog is the first in a two-part series that describes our experiences building this new capability, how we leveraged existing open-source components, unique problems we faced in adopting them, and new tech that we built in-house for resource management.

Uber已采取措施增强和现代化其机器学习平台。作为这一增强的一部分，Uber在2024年初将其机器学习工作负载迁移到Kubernetes®。这篇博客是两部分系列的第一部分，描述了我们构建这一新能力的经验，我们如何利用现有的开源组件、在采用它们时面临的独特问题，以及我们为资源管理内部构建的新技术。

Machine learning workloads are typically modeled as a sequence of steps. Most of these steps, especially in model training pipelines, tend to be heavy in data processing. This is because the amount of data fed into machine learning pipelines generally correlates well with the quality of the output model (Hestness et al. 2017; Banko and Brill 2001; Goodfelltscoow et al. 2016). For this reason, these jobs usually run in batch processing mode. Each step is modeled as a large distributed job that forms a graph of jobs that execute the pipeline.

机器学习工作负载通常被建模为一系列步骤。这些步骤中的大多数，尤其是在模型训练管道中，往往在数据处理上非常繁重。这是因为输入到机器学习管道中的数据量通常与输出模型的质量有很好的相关性（Hestness et al. 2017; Banko and Brill 2001; Goodfelltscoow et al. 2016）。因此，这些作业通常以批处理模式运行。每个步骤被建模为一个大型分布式作业，形成一个执行管道的作业图。

Until mid–2023, Uber ran its machine learning workloads primarily using a job gateway service called MADLJ (Michelangelo Deep Learning Jobs service). It ran Apache Spark™–based ETL jobs and Ray®-based machine learning training jobs. While this served us well, it had some pain points:

直到2023年中，Uber主要使用一个名为MADLJ（Michelangelo深度学习作业服务）的作业网关服务来运行其机器学习工作负载。它运行基于Apache Spark™的ETL作业和基于Ray®的机器学习训练作业。虽然这对我们来说效果很好，但也有一些痛点：

Difficult and manual resource management. ML engineers had to be aware of the heterogeneity of the compute resource where their jobs ran. They had to figure out the region, zone, and cluster in our compute fleet best suited for a given job. This included factors like GPU availabili...