Big Savings On Big Data

Motivation

In previous articles, we talked about the ML Platform of Lyft, LyftLearn, which manages ML model training as well as batch predictions. With the amount of data Lyft has to process, it’s natural that the cost of operating the platform is very high.

When we talked about how we democratized distributed compute, we described a solution with some key design principles such as fast iterations, ease of use, and enforcing good practices.

In early 2022, we completed this migration. Now is a good time to evaluate the impact of the design decisions over the last two years, in both increasing developer productivity and lowering cost.

Key Metrics

In this article, we define each run as executing a data/ML task using an ephemeral Spark/Ray cluster. The time and cost of _run_s are measured by their ephemeral Spark/Ray usage.

Runs are the way to use the LyftLearn big data system in both development and production. There are two main use cases in the development environment: running ad-hoc tasks and iterating in order to create a production workflow.

We will compare the metrics of runs between 2021 and 2022 in development (dev) and production (prod).

Productivity

In 2022, we had a huge increase in production usage.

Total number of runs (%) in the production and development

Total number of runs (%) in production and development

The total number of runs increased 286%, and prod runs increased 424%. In later sections, we will explain why the increase is not proportional between dev and prod.

We also boosted users’ development speed:

Comparison of average minutes required for one run in Development vs Production

The average per-iteration time (the blue bars) on big data reduced from 31 minutes to 11 minutes. That shows Lyft’s ML practitioners can iterate on large problems at 3x speed compared to 2021.

Notice that the prod run time increased slightly due to new heavier jobs. This also points to the fact that the large increase in prod runs is organic and is not due to breaking up large existing workloads.

Cost

More usage and faster iterations on big data commonly require more compute resource and higher cost. How much more did we spend in 2022 vs 2021?

Comparing the cost incurred in Production and Development

Surprisingly, in 2022, not only were we successful in controlling the overall cost (+6%), but we also managed to push a significant portion of development effort to production.

The total dev cost reduced 32% even though the dev usage slightly increased in 2022. How did we achieve that?

Comparing cost incurred per run in last 2 years for the development and production environments

We were able to reduce the average dev per-run cost from $25 to $10.7 (-57%). That means Lyft’s ML practitioners can iterate on large problems at half of the cost compared to 2021.

Another data point worth mentioning: the total AWS cost of LyftLearn reduced 6% in 2022 compared to 2021.

Reducing Compute Costs

In the previous article, we mentioned that the LyftLearn platform enforces ephemeral clusters. In the LyftLearn notebook experience, users can declare cluster resources for each step in their workflow. In the image below, a user is requesting a Spark cluster with 8 machines, each with 8 CPUs and 32 GB of RAM. The cluster is ephemeral and only exists for the duration of the SparkSQL query.

Defining Spark cluster configuration

Ephemeral & Static Clusters

Using ephemeral clusters has contributed a significant portion of total savings. Managed platforms like AWS Elastic MapReduce tend to require a data scientist to spin up a cluster and then develop on top of that cluster. This leads to under-utilization (due to idling) during project iteration. Ephemeral clusters ensure users are allocated costly resources only when necessary.

It’s also important to mention LyftLearn discourages Spark autoscaling. Autoscaling can lead to instability or underutilization. It is less useful when the clusters are already ephemeral. We also found similar patterns discussed in this article published by Sync Computing.

The benefits of being explicit on compute resources are:

Users are aware of the resources they really need for their cases.
Resource contention in the K8s clusters is reduced

A lot of LyftLearn users are surprised with the spin-up time (2–5 seconds) because of Kubernetes Spark with cached images. Ephemeral clusters also directly reduce maintenance because different steps of a workflow can be executed using different images to separate packages that conflict with each other (i.e. requiring different versions for dependencies).

Optimal Tooling Choices

Another big part of cost savings is choosing the tool that is most effective for the job. This is most evident with Presto and Hive. In this article, we shared the best practices for choosing them:

Presto is good for aggregation and small output scenarios — it shouldn’t take more than 10 minutes. If Presto is slow, try Hive.

Hive is slower but generally more scalable. Always try to save the output to files instead of dumping it into Pandas.

As more big data frameworks come into the landscape of data science, we need to choose the best tool for each part of the job. One of the important pieces of the LyftLearn platform is to provide data practitioners the flexibility and simplicity to choose the best tool for each job.

For example, some data pipelines inside Lyft leverage Spark for preprocessing and Ray for the distributed machine learning portion. This is also specifically enabled by ephemeral clusters. (Watch our Data AI Summit 2022 Talk)

Accelerating Development Iterations

Another less tracked form of savings are the hours saved due to operational efficiencies gained due to the LyftLearn platform. The huge reduction on time of dev runs and higher ratio of prod to dev number of runs directly translates to data scientists having more time spent on modeling and scientific computing. More importantly, more projects make it to production to generate real business value.

Our abstraction layer of compute, built on top of the open-source project Fugue, plays the key role in accelerating development iterations. It optimizes big data workstreams in three ways:

Localizing development

With a backend agnostic design, we can develop distributed computing logic (SQL or Python) on the local backend without a cluster. Only well tested code ends up running on clusters. This explains why in 2022 the increase of prod and dev runs were not proportional. A large portion of the iterations happened locally without using clusters.

Breaking up large workflows

This is one of the most important sources of LyftLearn savings.

Developing a complex Hive(Spark) query with hundreds of lines is one of the biggest and most common challenges for Lyft ML practitioners. Due to the Common Table Expression(CTE) syntax, breaking up a SQL query to small subqueries to run is not practical. Iterating on such queries requires re-running the whole query every time. In a worse situation, when a complex query never finishes, the owner can’t even know which step caused the problem. Retrying is inefficient and incurs big cost too.

FugueSQL is a superset of traditional SQL with improved syntax and features: it does not require CTE. Instead, the assignment syntax can make the SQL query easy to break up and combine.

Breaking up and combining complex SQL queries using FugueSQL

In the above example, let’s assume the original hive query has unknown issues. We can rewrite it in FugueSQL and break it up into multiple parts to iterate. In the first cell, YIELD FILE will cache b to a file (saved by Spark) and make the reference available for the following cells. In the second cell, we can directly use b which will be loaded from S3. Lastly, we can also print the result to verify. In this way we can quickly debug issues. More importantly, with caching, finished cells will not need to be re-run in the following iterations.

When multiple parts work end to end, we just copy-paste them together and remove the YIELD. Notice we also add a PERSIST to b, because it will be used twice in the following steps. This is to explicitly tell Spark to cache this result to avoid recompute.

FugueSQL should generate equivalent results as the original SQL, but it has significant advantages:

Divide-and-conquer becomes possible for SQL, significantly speeding up iteration time on complex problems.
The final FugueSQL is often faster than the original SQL (if we explicitly cache the intermediate steps to avoid recompute).

We can also easily construct back the traditional Hive SQL after we fix all problems in the iterations. The slowest and most expensive part is always the development iterations which we can improve using the Fugue approach.

Incremental adoption

We don’t require users to modernize their entire workloads in one shot. Instead, we encourage them to migrate incrementally with necessary refactoring.

There are a large number of existing workloads written with small data tooling such as Pandas and scikit-learn. In a lot of cases, if one step is compute intensive, then users can refactor their code to separate out the core computing logic, then use one Fugue transform call to distribute the logic.

Therefore, incremental adoption is also a natural process for users to adopt good coding practices and rewrite high quality code that is scale agnostic and framework (Spark, Ray, Fugue, etc.) agnostic.

Conclusion

The metrics shown from 2021 to 2022 led to both productivity boost and cost savings, and does not even include the benefits from human-hours saved from the enhanced development speed. Lyft’s top line also increased from the ML models that were able to reach production with the support of the LyftLearn platform.

Developing big data projects can be significantly expensive in both time and money, but LyftLearn succeeded in bringing down costs by enforcing best practices, simplifying the programming model and accelerating iterations.

As always, Lyft is hiring! If you’re passionate about developing state of the art systems join our team.