大数据的大节约
By Anindya Saha & Han Wang
Image by DALL·E
图片来源:DALL-E
Motivation
激励
In previous articles, we talked about the ML Platform of Lyft, LyftLearn, which manages ML model training as well as batch predictions. With the amount of data Lyft has to process, it’s natural that the cost of operating the platform is very high.
在之前的文章中,我们谈到了Lyft的ML平台LyftLearn,它管理着ML模型训练以及批量预测。由于Lyft需要处理的数据量很大,因此平台的运营成本自然很高。
When we talked about how we democratized distributed compute, we described a solution with some key design principles such as fast iterations, ease of use, and enforcing good practices.
当我们谈到我们如何使分布式计算民主化时,我们描述了一个具有一些关键设计原则的解决方案,如快速迭代、易于使用和执行良好的实践。
In early 2022, we completed this migration. Now is a good time to evaluate the impact of the design decisions over the last two years, in both increasing developer productivity and lowering cost.
在2022年初,我们完成了这次迁移。现在是评估过去两年的设计决策在提高开发人员生产力和降低成本方面的影响的好时机。
Key Metrics
关键指标
In this article, we define each run as executing a data/ML task using an ephemeral Spark/Ray cluster. The time and cost of _run_s are measured by their ephemeral Spark/Ray usage.
在本文中,我们将每次运行定义为使用一个短暂的Spark/Ray集群执行一个数据/ML任务。_run_s的时间和成本是由其短暂的Spark/Ray使用量来衡量的。
Runs are the way to use the LyftLearn big data system in both development and production. There are two main use cases in the development environment: running ad-hoc tasks and iterating in order to create a production workflow.
运行是在开发和生产中使用LyftLearn大数据系统的方式。在开发环境中有两个主要用例:运行临时任务和迭代,以创建一个生产工作流程。
We will compare the metrics of runs between 2021 and 2022 in development (dev) and production (prod).
我们将比较2021年和2022年之间在开发(dev)和生产(prod)中运行的指标。
Productivity
生产力
In 2022, we had a huge increase in production usage.
2022年,我们的生产用量有了巨大的增长。
Total number of runs (%) in production and development
生产和开发中的运行总数(%)。
The total number of runs increased 286%, and prod runs increased 424%. In later sections, we will explain why the increase is not proportional between dev and prod.
运行总数增加了286%,而prod运行增加了424%。在后面的章节中,我们将解释为什么dev和prod之间的增长不成正比。
We also boosted users’ d...