Lyft’s Reinforcement Learning Platform

Jonas Timmermann
Lyft Engineering
Published in
16 min readMar 12, 2024

--

A person holding a surfboard, riding a bike and skateboarding — all activities requiring (human) Reinforcement Learning

Summary

At Lyft we have built a platform for developing, training and serving Reinforcement Learning models for typical internet industry applications with a focus on Contextual Bandits. Those models have been critical for decision making problems that other techniques such as supervised learning or optimization models struggled with. In this article we describe how we extended our existing machine learning ecosystem to support Reinforcement Learning models, develop models using Off-Policy Evaluation, and the lessons learned along the way.

Introduction

Reinforcement Learning (RL) has shown great promise in research on challenging problems, from playing games to self-driving cars. However, resources on applying RL to typical business applications such as dynamic pricing or recommendations are fairly scarce. That’s why we want to share how we went about it and what worked for us.

Reinforcement Learning

At a high level, RL tests out different actions available to the model and observes feedback from the environment through a reward function. It then chooses the better performing actions for a state while maintaining some level of exploration to detect changes over time.

Diagram showing the relationship between the RL policy, environment and model update. The policy observes the state from the environment, takes an, which results in a reward that is fed to the model update step which in turn replaces the current policy with the updated one.
Simplified Reinforcement Learning Flow

Applied RL can be divided into three stages of maturity:

  • Multi-Arm Bandits (MAB) identify the variant that performs best globally without considering features. Its most popular application involves efficient A/B testing that splits traffic based on each variant’s performance rather than a static assignment.
  • Contextual Bandits (CB) take context features into consideration for finding the best performing variant. For example, one option might work better during the week while another on the weekend, similar to a classification model.
  • Full-RL can reason over multiple steps of sequential decision making to solve more complex problems, such as finding the path through a maze.

We use MABs for A/B tests and simpler applications like testing different messaging copies. While we have built some full-RL models, the main focus of this article is on Contextual Bandits, which are in the sweet spot in terms of expressiveness and complexity for most of our applications.

Motivation

Reinforcement Learning is a powerful approach for decision making problems. Compared to supervised learning, RL does not require a fully labeled data-set for training. This is particularly beneficial for applications whose correct solution is difficult to assess. Instead, we only need to define a reward metric that the model optimizes for. The metric can be fairly high level, for example conversion or revenue.

This highlights another strength of RL models — they optimize for the whole decision making process towards a target metric in potentially changing environments. Supervised models, on the other hand, only make predictions which need to be further processed for making a decision, e.g. a threshold for a regression model or some logic for aligning with business objectives. The reward function is a great control knob to support business priorities, including growth, profits, competitive pricing, and driver pay.

Another benefit is RL’s online learning nature that allows models to incrementally update rather than having to train on the whole history. In the most extreme case, the models can be updated from every observation they make. More typically, we perform batch updates anywhere from every 10 minutes to 24 hours. This keeps models fresh and allows for tracking non-stationary distributions in the environment, like changing customer preferences, a dynamic competitive landscape or accounting for a pandemic.

Promising RL use-cases are therefore applications for when no ground truth is available, where efficient exploration of options is part of the task or tricky problems that can’t be sufficiently captured with a mathematical optimization model.

However, there are also downsides to the RL-based approach.

  • Most Contextual Bandit libraries use fairly simplistic linear models that lack the expressiveness of tree or neural network models.
  • Solving a problem without fully labeled ground truth data is inherently more difficult. More data is necessary to learn the problem and it is more challenging to gauge a model’s performance. Off-Policy Evaluation (OPE) that we discuss in a later section is an approach to address this but it comes with its own challenges.
  • Finally, there are few mature libraries and little guidance on best practices available, which requires a lot of trial and error exploration.

Demo

In order to validate the operation of the Contextual Bandit model and our infrastructure, we created a simple recommendation model inspired by this Vowpal Wabbit (VW) use-case. It recommends different news article categories to a user based on the time of day and changing preferences over time.

Model

The model’s action space covers four different news article categories: [“politics”, “sports”, “music”, “food”].

It considers two categorical context features, user and time of day, with two values each to make its recommendation.

The exploration algorithm is epsilon greedy with 20% exploration.

Simulation

The simulation encodes the recommendation preference for a user based on the time of day. If the model selects the right arm, it receives a reward of 1, otherwise 0. In order to make the problem more challenging, the preferences change every half hour to test how well the model adapts to the shifting environment.

In a simulation cycle, the context features are randomly chosen and passed in a model scoring request. The recommendation response that is sent back from the model is validated against the current preference and in case of a match, an analytics event is emitted which will be processed in the next training cycle.

The following diagram shows an example query and the resulting analytics events that are consumed in the training job to update the model.

Diagram showing an example request from the client to the hosted model as well as its response and the events logged on both sides which are used in the policy update job
Messages between Client, Model and Training Job

Deployment

The model is hosted on our serving infrastructure, discussed in the next section, and handles requests made over the network to simulate a realistic production deployment. The incremental retraining runs every 10 minutes on the observations since the last cycle and deploys the updated model on the serving instance.

Diagram showing an example request from the client to the hosted model as well as its response and the events logged on both sides which are used in the policy update job
Click Through Rate (Reward)

The diagram shows the click-through rate (CTR) for a slightly more complex variant with seven actions. The CTR starts off at 1/7 as all the actions are evenly explored ①. The first model training cycle runs and the updated model gets synced to the serving instance ②. The performance reaches close to the optimum of 83% when accounting for the 20% of exploration that is spent on 6 out of 7 suboptimal variants.

At 12:45PM in the diagram, a distribution shift kicks in, tanking the model performance ③ and it takes until the next model-update cycle to recover ④. The same happens after the next distribution shift 30 minutes later ⑤.

The frequent, abrupt, strong distribution shifts are more of a stress test than a realistic business application. More subtle performance evaluation is discussed in the OPE section.

Extending Supervised Learning Platform

While there are some fundamental differences between RL and supervised learning, we were able to extend our existing model training and serving systems to accommodate for the new technique. The big advantage of this approach is that it allows us to leverage lots of proven platform components.

Architecture

This blog post gives a more detailed overview of our model hosting solution, LyftLearn Serving. Here we want to focus on the modifications required to support RL models which include:

  • Providing the action space for every scoring request
  • Logging the action propensities
  • Emitting business event data from the application that allows for calculation of the reward, e.g. that a recommendation was clicked on. This step can be skipped if the reward data can be inferred from application metrics that are already logged. However, if those metrics depend on an ETL pipeline, the recency of training data will be limited by that schedule.
Architecture diagram showing how models are registered to the backend, synced to the serving instance where they accept requests from the client. The logged data is fed into a data warehouse from where it’s consumed by the policy update job which registers the updated model to the backend.
RL Platform System Architecture

There are two entry points for adding models to the system. The Experimentation Interface allows for kicking off an experiment with an untrained bandit model. The blank model only starts learning in production as it observes feedback for the actions it takes and is typically used for more efficient A/B tests. The Model Development way is more suitable for sophisticated models that are developed in source-controlled repositories and potentially pre-trained offline. This flow is very similar to the existing supervised learning model development and deployment.

The models are registered with the Model Database and loaded into the LyftLearn Serving Model Serving instances utilizing the existing model syncing capabilities.

In Policy Update, the events for the model scores and their responses on the client application side are pulled from the Data Warehouse, joined and a customer provided reward function is applied. This data is used to incrementally update the latest model instance.

Finally, the retrained model is written back to S3 and promoted in the Model Database. The Policy Update is orchestrated by a Model CI/CD workflow definition which schedules the training job and takes care of promoting the new model.

Library

We leverage open-source libraries like Vowpal Wabbit and RLlib for modeling. In addition, we created our own internal RL library with model definitions, data processing, and bandit algorithm implementations to integrate with our infrastructure and make it easy for model creators to get started.

Vowpal Wabbit

For modeling CBs, we’ve chosen the Vowpal Wabbit library for its maturity with a decade of research and development, currently maintained by Microsoft Research. While it is not the most user-friendly ML library with some odd text-based interfaces, it comes with a wealth of battle-tested features, such as multiple exploration algorithms, policy evaluation methods, and advanced capabilities like conditional bandits. The authors are also prolific researchers in the field. For a comparison of different bandit techniques, the Contextual Bandit Bake-off paper is a great starting point.

Lyft RL Library

In order to integrate the VW Contextual Bandits and other RL models into our ML ecosystem, we created a library with the following components.

Model class hierarchy showing how the RL base model extends the general model class and different RL implementations like Vowpal Wabbit extend the base RL model
Model Class Inheritance Tree

Core

The core layer adapts the RL specific components to the existing interfaces of supervised learning models. This includes the RL base model class definition, which extends the generic model class and overrides model loading, training and testing components to the RL patterns. Additionally, it contains the data-models for events and the model response as well as utilities for extracting data from logged events, transforming training data and processing rewards.

Library

The library layer adds implementations of the abstract core base classes for particular applications, such as Vowpal Wabbit or our own MAB algorithms. For VW, this includes using the library’s serialization schemes, emitting performance metrics and feature weights as well as translating feature dictionaries to VW’s text-based format.

Evaluation

Another important component is the evaluation tooling for model development. This includes Shapley Values-based feature importance analysis which is helpful for context feature selection as well as customizations for the Coba Off-Policy Evaluation framework discussed in a later section.

Serving

RL models use the same scoring API endpoints as traditional models. The difference is in the additional arguments passed in the request body for RL models. Each model supports a model handler to process input data before passing it to the actual model artifact. This mechanism is used to perform the necessary feature transformations for the VW models and to translate the output back into our expected format.

Diagram showing how the RL models share the same foundation as supervised learning ones except for an RL specific shim
Serving layers in LyftLearn Serving

Training

There are two phases for training a model: warm-starting before the actual launch and continuous updates during the lifetime of the model.

Warm Start

The bandit model can be pre-trained offline on log data of an existing policy. The logging policy does not have to be a bandit but can be some heuristic that the CB model is supposed to replace. Ideally, the logged actions include their propensities. However, learning works without it as well, as long as reward data can be associated with the selected actions.

Warm-starting avoids the costly exploration phase and kick-starts the model performance without preventing it from adapting to changes in the environment over time. While warm-starting helps with reducing regret, it is not necessary, and models can start out with exploring all actions evenly and then adjusting their exploration based on recurring training cycles.

Continuous Updates

For continuous training, we use the same Model CI/CD pipeline that is used for the automatic retraining of supervised learning models. The model update queries join all model scoring events since the last training cycle with the relevant reward data. The model scoring event includes the context features, selected action, and the probabilities of all actions as well as a UUID.

Reward data can either be emitted by the business application explicitly for the purpose of training the model or business metrics that are captured anyway can be used for calculating the reward. We just have to make sure to link the model action to a particular outcome, typically by joining on a session ID that’s used as the model scoring UUID. Ideally, the reward is not directly emitted by the business application but rather the metrics for calculating the reward are logged, e.g. an article being selected. This allows for different reward functions to be evaluated and tweaked over time, e.g. giving higher rewards for newer or longer articles. For applications that receive their rewards delayed, the data joining logic needs to be a bit more sophisticated.

Once all the necessary data is joined, we extract the necessary training fields into a data frame, perform some processing including data cleaning and normalization and then update the model. In order to evaluate the training progress we emit VW’s internal training loss as well as the changes in feature weights.

Upon completion of the training cycle, the new model artifact is registered as the latest version and loaded into the model serving instance with a zero-downtime hot-swap.

Model Development using Off-Policy Evaluation

Figuring out the development and evaluation of applied RL models has had the steepest learning curve of the whole process.

Compared to supervised learning, there is no fully labeled data-set that can be used for training and performance evaluation. The model only receives a reward for the action it took and doesn’t know if another action would have been better for that context. Most RL research applications use a simulator or physics engine for evaluating the model, which gives virtually an unlimited amount of consistent observations. In that setting, the models can learn incrementally on-policy from every observation.

Most real-world problems are too complex to simulate sufficiently. Therefore, we focus on learning from logged observations. For a green-field application that might require deploying a simple model initially for the purpose of a data logger.

Mini-batch updates are also favored over learning from every single observation to have better model versioning control for roll-backs as well as making models stateless for easier distributed serving.

Off-Policy Evaluation

Gauging model performance is one of the most difficult aspects of Reinforcement Learning and deserves an article on its own. Here we only want to touch on some of the strategies and tools for addressing this challenge.

Contrary to supervised learning there’s no labeled ground truth that can be used to compute performance metrics like RMSE or accuracy. Partial feedback describes the problem of only receiving a potential reward for the selected action without knowing if another action would have received a higher reward. OPE addresses this issue by filling in the rewards for untaken actions through estimators to create a full-information dataset, similar to the training data in supervised learning. Candidate models are then evaluated on this data by running counterfactual analyses.

The most popular estimators are Inverse Propensity Score (IPS), Direct Method (DM) and Doubly Robust (DR) which have different trade-offs between variance and bias. IPS, for example, estimates the reward for an action of the candidate policy by dividing its propensity of taking the action by the one in the logged data, averaged across all observations of the same context. It multiples that weighting factor with the observed reward to account for sampling bias in the logged data. Estimators are an active field of research with various sub-variants and improvements still being developed.

Coba

Coba is an open-source Contextual Bandit benchmark framework that we contribute to and which has been instrumental for analyzing the performance of our models. While VW natively supports learning from logs, its evaluation and visualization capabilities are fairly limited. This is where Coba comes in with features including:

  • Data sourcing, either by reading bandit logs from file, converting a classification dataset, or by defining a simulation
  • Implementation of different learners as well as adapters for Vowpal Wabbit models
  • Experiment configurations for the way learners process data, e.g. batched, shuffled, multi-pass, etc.
  • OPE reward estimator implementations for IPS, DM and DR
  • Evaluator implementations for on- and off-policy evaluation as well as rejection sampling
  • Rich result evaluation metrics and visualizations
Coba architecture diagram showing the different components of environment, learners, evaluator, experiment and result
Coba Workflow

Workflow

The diagram shows a typical model development workflow that can be implemented in a Jupyter notebook.

  • The workflow starts with cleaning and scaling the logged data.
  • Next, the Environment is configured which includes selecting the data-source and any desired transformations as well as the reward estimator.
  • Then, the set of Learners and their hyperparameters are selected. It’s good practice to include a random policy and a simplistic learner for baseline performance. Hyperparameter optimization can be performed by creating learners from the combinations of different hyperparameter ranges.
  • The Evaluator contains the logic for training, scoring as well as reward generation, and is discussed in more detail in the next section.
  • The three components above are plugged into an Experiment with configs such as execution parallelism and file names.
  • The output is a Result object for which there is rich functionality for plotting and slicing the results.
  • Finally, additional data analysis can be performed such as evaluating model convergence by inspecting the context-specific arm entropy.

Evaluators

Evaluators are the core component of the benchmark. They feed the observations to the model being evaluated, execute training and scoring as well as calculate the rewards. There are on- and off-policy forms of learning and evaluating the model using different estimators.

Diagram showing the differences between on- and off-policy learning in how they process logged observations
Off- vs. On-Policy Evaluation Flows

The diagram compares the two approaches. In off-policy evaluation, the model is trained directly on the log data. The actions, propensities, and rewards marked with (on) for on-policy are only used for additional evaluation. In contrast, for on-policy evaluation, the model outputs are used to train the candidate model. For either approach, the Reward Estimator is built from the logged observations and will return a reward estimate for the on-policy action.

Off-Policy

Off-policy evaluation means that the behavior policy that logged the data and the target policy that’s being evaluated differ. That’s the case for training a model on logged observation data whose goal is to find a better performing policy through counterfactual analysis. The context features, action, its propensity, and reward for each observation are passed to Vowpal Wabbit’s training function. VW uses its own internal estimator to reduce the problem to a supervised learning one. Coba adds its own reward estimations and logs the selected actions of the candidate model for evaluating its performance.

On-Policy

In on-policy evaluation, the policy selecting the action is being evaluated. Rather than using the action and reward of the logging data, the candidate model is trained on its own scores. Rewards for the actions that the logging policy didn’t take are imputed by the selected Coba reward estimator. It generally doesn’t work as well for the offline learning use-case but is a good option to cross-check the off-policy results, especially if the action distribution is skewed or for testing different exploration algorithms.

Rejection Sampling

This methodology assesses the performance of different exploration algorithms on the logged data. It does so by filtering out samples that the candidate model unlikely would have taken by comparing the logged propensity with the candidate model’s action propensity. The challenge with this approach is that a large fraction of the samples, as much as 95%, is rejected when behavior and candidate policy differ significantly.

For more details on the evaluators, check out the implementation in Coba and this notebook which compares their performance.

Discussion

We’ve covered a lot of ground. Here, we want to reflect on the learnings of going through this process as well as give a sneak peek of what’s next for RL at Lyft.

Lessons Learned

Supporting RL models on a machine learning platform that’s extensible is actually fairly straightforward but getting the models to perform well is hard.

  • The evaluation of RL models is inherently more difficult due to the lack of labeled ground truth data but the tooling is also nowhere as mature as that for supervised learning. Therefore, it requires a lot of trial and error to build an intuition of how things work. A great way to do this is by working with data for which the underlying distribution is known. This can either be a classification dataset or a simulation where the reward distribution is controllable. Both are supported in Coba.
  • When moving to a real-world application, it’s important to normalize the reward as well as to evaluate different hyperparameters such as learning rate and interaction terms.
  • For evaluating a candidate model, relying solely on the total accumulated reward is risky as the measure is sensitive to estimator errors. Other metrics like context-specific convergence provide additional insights. Overlaying the candidate model’s arm selection frequencies with the average arm rewards in the log data is also helpful to get a sense of how well the model adapts to changes in the environment.
  • Running multiple evaluation passes over slightly shuffled versions of the data (just jiggling the order of the rows a bit to avoid scrambling longer term patterns) is helpful to gain confidence intervals.

Next steps

Our main focus is currently on leveraging the investments in the platform and tooling for more use-cases across our products. While building out these new models we are also investigating non-linear Contextual Bandits for improved model performance as well as better evaluation techniques for non-stationary problems. Additionally, we continue to evaluate full-RL solutions for more complex problems.

Acknowledgements

Big thanks go out to my teammates on the ML Platform team supporting this work-stream: Alex Jaffe, Mihir Mathur, Martin Liu and Konstantin Gizdarski.

Further kudos are in order for our early adopters who closely collaborated with us to get this initiative off the ground: Kedar Thakkar, Andrii Omelianenko, Akshay Sharma, Yanqiao Wang and Shaswat Shah.

A final thank you goes out to Mark Rucker for creating Coba and being a generous advisor.

If you’re interested in working with us on building state-of-the-art ML systems or solving other complex challenges to create the world’s best transportation service, we’d love to hear from you! Visit www.lyft.com/careers to see our openings.

--

--