SkyPilot 在 Shopify:多云 GPU 无痛使用
Running ML workloads at scale means dealing with GPUs. And GPUs are annoying. They're scarce, they're fragmented across clouds, and every provider has its own way of doing things. H200s here, L4s there, different APIs, different configurations. When you just want to train a model, the last thing you need is to become an expert in three different cloud consoles.
大规模运行 ML 工作负载意味着要处理 GPU。而且 GPU 很烦人。它们稀缺,分散在各个云上,每个提供商都有自己的做事方式。这里是 H200s,那里是 L4s,不同的 API,不同的配置。当你只想训练一个模型时,最后一件你需要的事就是成为三个不同云控制台的专家。
At Shopify, ML work touches almost everything. Training all this machinery needs a lot of GPUs.
在 Shopify,ML 工作几乎触及一切。训练所有这些机器需要大量 GPU。
We use SkyPilot for all our training workloads. It's an open-source framework that lets you define jobs in YAML and run them on whatever cloud has capacity. You say what you need (GPUs, memory, disk) and the system figures out where to put it.
我们所有训练工作负载都使用SkyPilot。它是一个开源框架,允许您用 YAML 定义作业,并在任何有容量的云上运行它们。您说明您需要什么(GPUs、memory、disk),系统会弄清楚把它放在哪里。
SkyPilot comes equipped with many wonderful features, but we needed to make it work for an organization like ours. This required some expansions to support multi-team management, cost tracking, fair scheduling, the whole thing.
SkyPilot 配备了许多出色功能,但我们需要让它适用于像我们这样的组织。这需要一些扩展来支持多团队管理、成本跟踪、公平调度等等。
The architecture
架构
We run persistent Kubernetes clusters on multiple clouds. Shopify uses SkyPilot as a launcher; it doesn't provision infrastructure (although it could). It schedules jobs onto clusters we already manage. Think of it as a smart scheduler that knows which cluster to target based on what you're asking for.
我们在多个云上运行持久的 Kubernetes 集群。Shopify 将 SkyPilot 用作启动器;它不提供基础设施(尽管它可以)。它将作业调度到我们已经管理的集群上。可以将其视为一个智能调度器,它根据你的需求知道要针对哪个集群。
Our data never leaves our control. Training datasets live in storage we own, replicated across clouds. When we train on Nebius, for instance, data comes from volumes within that environment. When we run on GCP, same story. Jobs run where the data already is.
我们的数据始终在我们控制之下。训练数据集存储在我们拥有的存储中,跨云复制。例如,在 Nebius 上训练时,数据来自该环境内的卷。在 GCP 上运行时,...