LyftLearn Evolution: Rethinking ML Platform Architecture

Written by Yaroslav Yatsiuk

At Lyft, machine learning (ML) is the engine behind our most critical business functions — from dispatch and pricing optimization to fraud detection and support automation. Our ML infrastructure serves thousands of production models making hundreds of millions of real-time predictions per day, supported by thousands of daily training jobs that keep ML models fresh and accurate.

As our scale grew, we faced a classic engineering challenge: the very complexity that powered our platform was becoming a bottleneck to its future growth. We needed to answer a fundamental question: How could we evolve our platform to accelerate innovation for our users while simplifying its underlying architecture?

This post explores how we rethought LyftLearn’s architecture to solve this problem. We’ll walk through our transition from a fully Kubernetes-based system to a hybrid platform, combining the simplicity of managed compute on AWS SageMaker for offline workloads with the flexibility of Kubernetes for online model serving. Afterwards, we’ll share the key technical decisions and trade-offs that made this evolution possible.

LyftLearn Overview

LyftLearn is Lyft’s end-to-end machine learning platform, managing the complete ML lifecycle from model development to production serving. Built to support hundreds of data scientists and ML engineers, it handles the full spectrum of ML workloads at scale. The platform is composed of three integrated products:

Figure 1: LyftLearn Components

LyftLearn Compute (Offline Stack) handles model development and training workloads. ML Modelers use JupyterLab environments to prototype models, then run training jobs, batch processing, and hyperparameter optimization at scale. These workloads are elastic and on-demand — they spin up when needed, process large datasets, and terminate when complete.

LyftLearn Serving (Online Stack) powers production inference, serving millions of predictions per minute with millisecond latency. It provides online model serving with real-time ML capabilities, automated deployment and promotion workflows, and online validation to ensure model quality before production traffic.

LyftLearn Observability monitors model health and detects degradation across the platform. It tracks performance drift, identifies anomalies, scores model health, and monitors model activity to ensure production models maintain quality as data and business conditions evolve.

While all three components work together to provide a unified ML platform, the offline and online stacks have fundamentally different operational characteristics. Offline workloads need elastic, cost-efficient compute that scales to zero between jobs. Online model serving requires always-on infrastructure with strict latency guarantees and tight operational control. These differences led us to adopt different infrastructure strategies for each — and it’s the evolution of our offline stack that transformed how we deliver LyftLearn Compute today.

The Original Architecture

The original offline stack (LyftLearn Compute) ran entirely on Kubernetes — every training job, batch prediction, hyperparameter optimization run, and JupyterLab notebook environment executed as a Kubernetes workload, orchestrated through a collection of custom-built services. We documented this architecture in detail in our 2021 blog post, LyftLearn: ML Model Training Infrastructure built on Kubernetes.

The following diagram shows a high-level view of the LyftLearn Compute 1.0 architecture:

Figure 2: LyftLearn Compute 1.0 high-level architecture

To understand the operational complexity, let’s look at some of the key components and how they worked together:

LyftLearn Service served as the backend API, receiving requests from three primary sources: the LyftLearn UI for ad-hoc jobs, Airflow DAGs for scheduled training and batch prediction pipelines, and CI/CD pipelines that registered models along with their Docker images during deployments. It managed model configurations, job metadata, and coordinated with downstream services.

K8s Orchestration Service translated job requests into Kubernetes resources. When LyftLearn Service called it to create a training job, it would:

insert the job record in LyftLearn DB (so watchers could track it)
construct the Kubernetes Job specification with containers, resource requests, environment variables, sidecars, references to docker images in AWS Elastic Container Registry (ECR), and other K8s resources
submit the job to the Kubernetes cluster

Background Watchers ran continuously to manage jobs lifecycle and infrastructure. We maintained multiple worker scripts handling different responsibilities:

job status watcher (monitoring job state transitions and timing)
container status watcher (tracking individual container states)
ingress status watcher (managing notebook endpoint URLs)
job cleanup watcher (removing completed jobs from Kubernetes)
analytics event watcher (capturing usage events)
additional scripts for EFS cleanup, spending tracking, and stats publishing

Creating any job meant assembling a complete set of Kubernetes resources: Pod specifications with init and sidecar containers for secrets and metrics, ConfigMaps for hyperparameters, Secrets for credentials, PersistentVolumeClaims for notebook storage, Services and Ingresses for network access, and role-based access control (RBAC) policies (ServiceAccounts, Roles, RoleBindings) for cluster permissions. In essence, we owned the entire operational lifecycle — from scheduling and retries to cleanup and low-level resource management.

The Kubernetes-based architecture successfully powered production ML workloads for years and delivered some real technical advantages, including:

Unified Infrastructure Stack
ML workloads ran on the same Kubernetes infrastructure as Lyft’s production services, using the same networking stack, observability tooling, security patterns, and operational processes. This meant the platform team leveraged existing infrastructure expertise and tooling rather than maintaining separate systems for ML workloads.

**Fast Job Startup
**Jobs could launch as fast as 30–45 seconds on existing K8s cluster infrastructure. Unlike on-demand compute provisioning which requires waiting for instances to start and initialize, jobs scheduled immediately onto available nodes with cached images, making the approach particularly effective for frequently running training jobs and batch processing workflows.

Flexible Resource Specifications
Engineers could request any CPU/memory combination their workload needed. Memory-intensive preprocessing jobs could request 16 CPUs with 512GB RAM, while CPU-intensive training jobs used 64 CPUs with 128GB RAM. These ratios didn’t map cleanly to fixed AWS instance types, so this flexibility allowed precise resource allocation based on workload needs.

This architecture served hundreds of engineers running thousands of daily jobs that powered business-critical ML workflows. However, as Lyft’s scale grew, so did the operational complexity of managing such a system.

We identified several key challenges that were consuming an increasing amount of our focus:

**The Feature Tax
**Every new capability we added to the platform, from distributed hyperparameter optimization using Katib/Vizier to distributed training with Kubeflow operators, required building, deploying, and maintaining a corresponding set of custom Kubernetes orchestration logic. While this approach gave us maximum control, it also meant that a significant portion of our development cycle was dedicated to building and managing the infrastructure for each new feature, rather than the feature itself.

**Managing State in a Distributed System
**To keep our platform’s database synchronized with the cluster state, we relied on background watcher scripts that continuously monitored Kubernetes events for job status changes, container updates, and ingress resource availability.

The eventually-consistent nature of Kubernetes created operational complexity. Training containers could succeed while Kubernetes marked jobs as failed due to sidecar issues. Event streams would timeout or arrive out of order. Container statuses could transition between states as different watchers processed conflicting events. We developed sophisticated synchronization checks and logic to handle these cases, but managing state consistency for thousands of daily jobs required considerable on-call attention and directly impacted our development velocity.

**Kubernetes Cluster Management
**A persistent challenge in managing a large-scale ML compute platform is optimizing resource utilization for heterogeneous workloads. ML jobs often have distinct phases with conflicting resource profiles: data processing tends to be memory-intensive, while model training is often CPU- or GPU-intensive. This created a complex optimization puzzle, making it challenging to maximize node utilization across the cluster.

As the platform grew, we also had to proactively manage resource contention during bursts of highly parallel workloads. Ensuring that the cluster autoscaler could provision capacity quickly enough to prevent job queuing for critical workflows required careful planning and continuous management.

The pattern was clear: as the platform scaled, so did the operational investment required to manage its low-level infrastructure. To continue innovating for our users, we needed to abstract away this underlying complexity and refocus our efforts on what mattered most: building new platform capabilities, optimizing ML workflows, and accelerating the entire ML development lifecycle_._

The Journey to LyftLearn 2.0

The growing operational complexity of our Kubernetes stack was limiting our development velocity. This reality pushed us to explore how we could simplify operations while delivering more powerful capabilities to our users. We began evaluating managed solutions to abstract this infrastructure complexity, which led us to a deep evaluation of AWS SageMaker.

We evaluated SageMaker across both our online (LyftLearn Serving) and offline (LyftLearn Compute) stacks.

For LyftLearn Serving, adopting SageMaker would have required a fundamental re-architecture of our core workflows. Our model deployment, promotion, and serving solutions were deeply integrated with Lyft’s internal tooling. Observability relied on our standard monitoring infrastructure, not on AWS CloudWatch. Client services communicated via Envoy, not via SageMaker’s specific invocation and authentication patterns.

Our analysis confirmed that the existing Kubernetes-based stack was exceptionally reliable and efficient, performing well within our required latency requirements. We determined the right path forward was to retain our existing, battle-tested model serving infrastructure.

For LyftLearn Compute, the evaluation pointed in a different direction. This was where our greatest operational complexity lived: managing eventually-consistent job states, optimizing cluster capacity for heterogeneous workloads, and building custom Kubernetes orchestration for new ML capabilities.

SageMaker’s managed infrastructure would address these challenges directly. It offered out-of-the-box support for a variety of job types, which would allow us to stop building and maintaining low-level orchestration logic. Its native state management would eliminate the need for our custom watcher system, and its elastic compute model would handle capacity automatically, removing the need for complex cluster planning and autoscaling management.

While SageMaker’s per-instance costs were higher, the Total Cost of Ownership (TCO) was clearly lower. By eliminating idle compute, cluster administration overhead, and the constant infrastructure firefighting, the economics of a managed service made sense.

The evaluation led to a clear strategy: adopt SageMaker for LyftLearn Compute, where we had the greatest opportunity to reduce operational complexity, and retain Kubernetes for LyftLearn Serving, where our existing solution was already highly reliable and efficient.

The diagram below provides a high-level, conceptual view of how we wanted to transform the offline stack:

Figure 3: LyftLearn Compute Evolution Plan

On the left, the original architecture: LyftLearn Service sent job requests to a K8s Orchestration Service, which constructs Kubernetes Job specifications and submits them to the Kubernetes API. This orchestration service was complex — it managed pod configurations, resource allocations, volumes, and all the low-level details of Kubernetes jobs. Background watchers continuously polled the Kubernetes API for events — job completions, container status changes, resource updates — and wrote those updates back to the LyftLearn database. The compute layer ran on Lyft-managed Kubernetes clusters.

On the right, the new architecture: Under the hood, this is a significantly simpler solution. LyftLearn Service interacts with a lean SageMaker Manager Service that only makes AWS SDK calls — it doesn’t manage any low-level infrastructure. We replaced the fleet of problematic background watchers with a single, reliable SQS consumer that processes status updates pushed from EventBridge. The heavy lifting of orchestration and state management is delegated to AWS. The goal was simplification without losing power.

It looks simple on a diagram, but making this transition without disrupting critical ML workflows and hundreds of users was a significant engineering challenge. The following sections detail some of the most difficult technical challenges we solved to make this migration possible.

Our core principle for the migration was to replace the execution engine — Kubernetes to SageMaker — while keeping our ML workflows completely unchanged. The actual ML code — the Python scripts that train models, process data, and run inference — had to work identically on both platforms. No modifications to model training logic, no changes to data preprocessing, no updates to inference code.

Forcing hundreds of users across dozens of teams to rewrite their business-critical ML workflows was not an option. The cost of such a disruption in terms of lost productivity and engineering effort would have made the migration untenable, which meant the burden of compatibility was entirely on our platform. The requirement of zero code changes transformed the project into a complex systems engineering challenge for the ML Platform team, but it was a necessary one. The real task wasn’t just running a container on a different platform — it was ensuring environmental parity.

During the transition, we solved numerous challenges across the stack. Here are a few of the most complex ones we solved to make this possible.

**Replicating the Kubernetes Runtime Environment
**Our Kubernetes environment provided automatic credential injection via webhooks, metrics collection through sidecars, and configuration management via ConfigMaps. SageMaker offered none of these primitives. We built a compatibility layer into cross-platform base Docker images to replicate this behavior:

Credentials: In Kubernetes, credentials from our internal secret management solution, Confidant , were automatically injected at pod creation. SageMaker has no equivalent mechanism. We built a custom solution, as part of the container entrypoint script, that fetches credentials at job startup and exposes them exactly as Kubernetes did, ensuring user code worked identically on both platforms
Environment Variables: SageMaker constrains the number of environment variables passed via its API. Similar to our credential solution, we moved most environment setup to runtime, fetching additional configuration at job startup.
Metrics: Kubernetes workloads sent StatsD metrics to sidecar containers. SageMaker has no sidecar support, so we reconfigured the runtime and networking to connect directly to our metrics aggregation gateway. The user-facing API remained unchanged.
Hyperparameters: In Kubernetes, hyperparameters were stored in ConfigMaps and mounted as files. SageMaker’s API has much stricter size limits than K8s, making direct parameter passing impossible for our use cases. We developed a solution to upload hyperparameters to AWS S3 before each job and have SageMaker automatically download them to its standard input path. This overcame the API limitation while still using SageMaker’s native capabilities.

These represent only a subset of the environmental differences we systematically solved across the migration.

**Building for the Hybrid Architecture
**We developed new SageMaker-compatible base images to replace our old LyftLearn images. The critical design requirement was that these images must work across our entire hybrid platform: in SageMaker (for training and batch processing) and in Kubernetes (for serving). This meant the same Docker image that trained a model would also serve it, guaranteeing consistency. These base images serve as a foundation that teams extend with their own dependencies.

We built SageMaker-compatible base images with different capabilities to match our workload diversity. Here are some of the most important ones:

LyftLearn image: For traditional ML workloads
LyftLearn Distributed image: Adds Spark ecosystem integration for distributed processing
LyftLearn DL image: Adds GPU support and libraries for deep learning workloads

The Spark-compatible images presented the biggest challenge. They needed to maintain full compatibility with our existing Spark infrastructure — custom wrappers, executor configurations, and JAR (Java Archive) dependencies. But they also had to run correctly in three distinct execution contexts: SageMaker Jobs, SageMaker Studio notebooks, and Model serving in K8s.

These images detect their execution environment at runtime and adapt. They automatically configure different environment variables, use different users and permissions, and set up Spark appropriately for each context, all while preserving an identical core runtime.

**Matching Kubernetes Job Launch Times
**In Kubernetes, notebooks, training, and processing jobs could start quickly because nodes were warm due to a significant percentage of cluster resources sitting idle. SageMaker provisions instances on-demand — no idle waste, but slower startup.

For JupyterLab notebooks, we adopted SOCI (Seekable Open Container Initiative) indexes. SOCI enables lazy loading: SageMaker fetches only the filesystem layers needed immediately rather than pulling entire multi-gigabyte images. This cut notebook startup times by 40–50%.

For training and batch processing jobs, SOCI wasn’t available. We optimized our Docker image sizes, which were sufficient for most of our workloads. However, this wasn’t enough for our most latency-sensitive workflows. Some models retrain every 15 minutes, making slower startup times unacceptable. For this subset of jobs, we adopted SageMaker’s warm pools, which keep instances alive between runs.

These optimizations gave us Kubernetes-like startup times with fully serverless infrastructure.

**Cross-Cluster Networking for Spark
**Many of our ML Platform users rely heavily on the interactive Spark experience in JupyterLab notebooks. In Kubernetes, this was simple, as the driver and executors ran in the same cluster. The new architecture, however, required the Spark driver to run in a SageMaker Studio notebook while the executors remained on our EKS K8s cluster.

This hybrid model presented a major networking challenge, as shown in the diagram below. Spark client mode requires bidirectional communication:

Figure 4: Spark Networking Architecture in LyftLearn 2.0

The driver (in SageMaker) must call the EKS API Server Endpoint to request executor pods.
The executor pods must be able to establish inbound connections directly back to the driver’s SageMaker Instance Elastic Network Interface (ENI).

The default SageMaker Studio networking blocked these critical inbound connections, breaking Spark’s communication model. This issue was a fundamental blocker that could jeopardize the entire migration. Without a solution for interactive Spark, we could not move our users to SageMaker Studio. To resolve this, we partnered closely with the AWS team. As a result of this collaboration, they introduced networking changes to the Studio Domains in our account that enabled the required inbound traffic from our EKS cluster. Despite the cross-cluster setup, Spark performance remained the same, and the interactive experience for ML Platform users was identical to the original Kubernetes environment.

LyftLearn 2.0: The Hybrid Architecture

As a result of this architectural transformation, we arrived at the hybrid architecture we planned: SageMaker for LyftLearn Compute and Kubernetes for LyftLearn Serving.

Figure 5: Complete LyftLearn 2.0 High-Level Architecture

As the diagram illustrates, the two systems are fully decoupled, each operating as a purpose-built stack:

LyftLearn Serving runs on Kubernetes, powering a distributed architecture for real-time inference. Dozens of ML teams deploy their own model serving services — each containing their team’s models with custom prediction handlers and configurations — handling production predictions for specific use cases (pricing, fraud, dispatch, ETA, etc.). The Model Registry Service coordinates model deployments across these services. (We detailed this serving architecture in our 2023 blog post: Powering Millions of Real-Time Decisions with LyftLearn Serving.)

LyftLearn Compute runs on SageMaker, where the SageMaker Manager Service orchestrates training, batch processing, Hyperparameter Optimization (HPO), and JupyterLab notebooks through AWS SDK calls. EventBridge and SQS provide event-driven state management, replacing our background watchers.

Integration happens through the Model Registry and S3. Training jobs in SageMaker generate model binaries and save them to S3. The Model Registry tracks these artifacts, and model serving services pull them for deployment. Docker images flow from CI/CD through ECR to both platforms. The LyftLearn database maintains job metadata and model configurations across both stacks.

Each LyftLearn product operates independently while maintaining seamless end-to-end ML workflows.

Putting It All Together

We rolled out changes repository by repository, running both infrastructures in parallel. Our approach was systematic: build a comprehensive compatibility layer that made SageMaker feel like Kubernetes to ML code, validate each workflow type thoroughly, then migrate teams incrementally. Each repository required minimal changes — typically updating configuration files and workflow APIs — while the actual ML code remained untouched.

For our users, the migration was nearly invisible. But behind the scenes, the operational improvements were substantial. We reduced ML training and batch processing compute costs by eliminating idle cluster resources and moving to on-demand provisioning. System reliability improved significantly, with infrastructure-related incidents becoming rare occurrences. Most importantly, this stability and the serverless nature of the new compute freed our team to focus on building platform capabilities rather than managing low-level infrastructure components.

Build versus buy is a pragmatic decision, not an ideology
We adopted SageMaker for training because managing custom batch compute infrastructure was consuming engineering capacity better spent on ML platform capabilities. We kept our serving infrastructure custom-built because it delivered the cost efficiency and control we needed. The decision wasn’t about preferring managed services or custom infrastructure — it was about choosing the right tool for each specific workload.

Abstract complexity from users.
The migration succeeded because we absorbed all the complexity. Users didn’t rewrite ML code or learn SageMaker APIs — they continued their work while we handled secrets management, networking, metrics collection, and environmental parity. The platform’s job is to evolve infrastructure while preserving velocity and avoiding disruptions, not to distribute migration work across hundreds of teams.

**Invest in compatibility layers
**The cross-platform base images were the foundation of the migration’s success. They enabled gradual, repository-by-repository migration with easy rollbacks. Most importantly, they guaranteed that the same Docker image for model training in SageMaker would serve it in Kubernetes, eliminating train-serve inconsistencies. The upfront investment in cross-platform compatibility paid dividends throughout the migration.

The best platform engineering isn’t about the technology stack you run — it’s about the complexity you hide and the velocity you unlock.

Lyft is hiring! If you’re passionate about building AI/ML platforms and applications at scale, visit Lyft Careers to see our openings.