LyftLearn如何通过Kubernetes Spark和Fugue实现分布式计算的民主化

In a previous blog post, we discussed LyftLearn’s infrastructure built on top of Kubernetes. In this post, we will focus on the compute layer of LyftLearn, and will discuss how LyftLearn solves some of the major pain points faced by Lyft’s machine learning practitioners.

在之前的一篇博文中，我们讨论了LyftLearn建立在Kubernetes之上的基础设施。在这篇文章中，我们将关注LyftLearn的计算层，并将讨论LyftLearn如何解决Lyft的机器学习从业者所面临的一些主要痛点。

Efficiently utilizing computing resources

有效地利用计算资源

Pain points

疼痛点

There is always a tradeoff between user convenience and efficient resource utilization. When LyftLearn users need more compute power, it’s a lot easier to scale vertically (more powerful machines) as opposed to horizontally (more machines). This is less work for users, but the downside is that large machines are often only needed for a portion of the workload. The rest of the time, the utilization is low and huge costs are incurred.

在用户的便利性和有效的资源利用之间总是存在着权衡。当LyftLearn用户需要更多的计算能力时，纵向扩展（更强大的机器）要比横向扩展（更多的机器）容易得多。这对用户来说是更少的工作，但缺点是大型机器往往只需要一部分工作负荷。其余的时间，利用率很低，而且会产生巨大的成本。

For example, using 90-core EC2 instances to run Jupyter notebooks that only use 60 cores would waste 30 cores and result in a utilization rate of 67%. Since our Kubernetes cluster runs on EC2 instances, our platform team is primarily concerned with maximizing its cost-efficiency.

例如，使用90核的EC2实例来运行只使用60核的Jupyter笔记本会浪费30核，导致利用率为67%。由于我们的Kubernetes集群在EC2实例上运行，我们的平台团队主要关注其成本效率的最大化。

Solution: Kubernetes Spark

解决方案。Kubernetes Spark

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on clusters. It offers superb scalability, performance, and reliability. Using Spark with Kubernetes further reduces the overhead of starting Spark jobs through image caching. Kubernetes enables individual users to use their own images. For each cluster node, the pod architecture allows for heterogeneous Spark workers that don’t interfere with each other.

Apache Spark是一个多语言引擎，用于在集群上执行数据工程、数据科学和机器学习。它提供了极佳的可扩展性、性能和可靠性。与Kubernetes一起使用Spark，通过镜像缓存进一步减少了启动Spark工作的开销。Kubernetes使个人用户可以使用自己的镜...