Full-Spectrum ML Model Monitoring at Lyft

Photo by Vasundhara Srinivas on Unsplash

Machine Learning models at Lyft make millions of high stakes decisions per day from physical safety classification to fraud detection to real-time price optimization. Since these ML model based actions impact the real world experiences of our riders and drivers as well as Lyft’s top and bottom line, it is critical to prevent models from degrading in performance and alert on malfunctions.

However, identifying and preventing model problems is hard. Unlike problems in deterministic systems whose errors are easier to spot, models’ performance tends to gradually decrease, which is more difficult to detect. Model problems stem from diverse root-causes including:

Bugs in the caller service which passes wrong features or incorrect units to the model (garbage in results in garbage out)
Unexpected change in an upstream feature definition
Distribution changes of input features (Covariate Shift)
Distribution changes of output labels (Label Shift)
Conditional distribution changes for output given an input (Concept Drift)

One example of a model problem happened when Estimated Times of Arrival (ETAs) fell in response to declined demand due to COVID. The ETA models were retrained since they were over-predicting ride times due to being trained on historical demand. This, however, caused other models which took ETAs as input to dramatically under-predict on pricing, revealing one of the many challenges we hadn’t originally anticipated.

In early 2020, motivated by examples like the one above and by the rapid influx of production ML models at Lyft, we embarked on a journey to build a robust system for identifying and preventing model degradation. In this post, we will discuss the variety of model monitoring approaches we developed and the cultural change needed to get ML practitioners to effectively monitor the models powering Lyft. We hope our learnings help other ML platform teams that consider building (or buying) a model monitoring system in navigating the space.

The Challenges We Faced

Prioritization among different monitoring techniques: There are several factors that affect the design of a model monitoring system: type of model (online or offline), what to monitor (features or predictions), when to monitor (during training, during real-time inference or on a schedule post inference), and whether to consider samples by themselves or in a time-series context. Given a limited engineering bandwidth, what exactly should we prioritize building to deliver the most impact?
Technical challenges: The technical challenges included extremely tight latency requirements, long delays in getting the ground truth for some models, and building in a way that lets ML modelers configure and fine-tune alarms.
Lack of standardization: Before introducing a platform solution, a few ML modelers at Lyft had built some form of custom monitoring for their models. Building one-off monitoring systems per model in a company with hundreds of models is akin to building dams on minor tributaries — there is duplication of work and no centralized visibility and control. By building a centralized model monitoring system we wanted to build a dam at the source — performant, flexible, and cost-effective.
Adoption: A challenge we faced was getting all ML modelers at Lyft to effectively use the suite of model monitoring approaches in our centralized system as well as instrumenting existing models.

Strategy

Our general strategy for creating a comprehensive model monitoring system was to build in two phases. The focus of the first phase was on monitoring techniques that are quick to onboard models and agile enough to catch the most obvious model problems, i.e. Feature Validation and Model Score Monitoring. The second phase focused on building more powerful offline monitoring techniques, such as Performance Drift Detection and Anomaly Detection, which are capable of diagnosing complex problems.

Spectrum of Model Monitoring Techniques

(Note: We have a separate system for Data Quality monitoring, Verity, that is complementary to our model monitoring system, and is used for offline semantic correctness checks across column values of data tables.)

Model Monitoring Techniques

In this section, we’ll discuss the implementation of our different monitoring modes.

Model Score Monitoring

For every model scoring request made to our online model serving solution, LyftLearn Serving, the system emits the model output to our metrics system. We can then define time-series based alerts on this data.

Model Score Monitoring System Components

Out of the box, the system checks for that models are not stuck emitting the same score over a period of time, which indicates potential upstream issues.

In order to reduce false-positives, this query requires a minimum number of requests made to the model (by default, ten requests over a period of an hour). Each team can parametrize the alert to their requirements and add their own alerts on the data, e.g. checking that the average model score is within a certain value range.

The benefits of this approach are that it allows for checks to be online and on time-series data, while the approaches discussed next only support one or the other. Another strength is that a base level of protection is provided without requiring any effort from the model owner. The downsides are that only data in the metrics system can be referenced, e.g. no ground-truth data, and that the data processing capabilities are limited.

Feature Validation

This technique validates the features of every prediction request online against a set of expectations on that data.

Feature Validation System Components

Typical data tests include:

Type checks — passing a string of “1” instead of the integer 1
Value ranges — -10 for a distance or 1000 for an age
Missing values for a required feature — “None” for city
Set membership — an unknown label for a categorical feature
Table expectations — checking that all the expected feature names are present

For defining those expectations we use the Great Expectations open-source library. A basic set of expectations can be automatically generated by running a profiler on the feature data set of the model. Further, expectations can be conveniently created with the library, typically in a notebook environment.

Here’s an example of a simple expectation set:

suite.expect_column_values_to_not_be_null(‘city’, meta={‘severity’: ‘ALERT’})suite.expect_column_values_to_be_in_set(‘ride_type’, {‘regular’, ‘shared’, ‘xl’})suite.expect_column_values_to_be_between(‘distance’, 0, 100, meta={‘condition’: [distance, ‘!=’, None]})

The expectation suite is registered with our backend, synced to the model serving system, and then applied within our model monitoring library to validate every incoming request.

Model Monitoring Library

On top of Great Expectations we have built our own model monitoring library to tie the validation into our infrastructure and add features geared towards online feature validation including:

Tagging expectations with the severity of a violation which is used to determine if model owners should be paged right away or only after exceeding a threshold
Conditional expectation support — a common scenario is to only perform range checks on a feature value when the feature has a numerical value and otherwise ignore null values
Integration with Lyft’s stats and logs systems
A data profiler geared towards the type of expectations relevant for our application
Abstractions for using other data validation libraries
A low-latency data validator — Great Expectations typically validates large data-sets against expectation suites. For our online use-case in which we only validate a single feature set (row) at a time, an implementation with less overhead was required. Our light-weight validator reduced the latency by over 500x to 0.1ms for a typical feature set. The validation is further performed async to make the latency impact on model scoring negligible.

Anomaly Detection

Anomaly detection is a technique that identifies potential model problems by analyzing statistical deviations of logged features and predictions over long periods of time. The calculation of aggregate metrics and the evaluation of deviations run on a schedule, typically daily. In our experience, the most indicative signals for potential problems are z-scores of over 2 for:

Call volume
Model score mean
Feature value mean
Feature null percentage

Anomaly Detection System Components

The biggest benefit of this approach is that statistical checks for all numerical features and model scores are performed automatically and require no onboarding from users. On the flip side, there tends to be many false positives because features and predictions can deviate statistically without implying a problem with the models (e.g. at New Year’s Eve the rider_intents feature might be several standard deviations higher than the week leading up to it). For this reason, we mostly consume the results in the form of reports rather than alerting the model owners.

Performance Drift Detection

Our most powerful model monitoring technique for catching performance degradation over time is Performance Drift Detection. In this offline monitoring approach we retrieve arbitrary data, perform transformations on it, and then validate the output against expectations. This system is built using Lyft-Distributed powered by Kubernetes Spark and Fugue, which were discussed in our last blog post. The most popular use-case is to join model scores with their ground-truth data and calculate a performance metric on it. We use the same model monitoring library as in Feature Validation to evaluate the output against expectation suites and emit relevant metrics and logs.

Performance Drift Detection System Components

The user-provided parts for Performance Drift Detection are:

SQL query for retrieving the ground truth and predictions from the data store
Post-processing steps on the query output for computing the performance metrics
A collection of expectations for those metrics

The following JSON snippet shows an example configuration with the mean error metrics generated in the SQL query and the squared error ones generated through post-processing:

{ "drift_detection": { "associated_model_uuids": ["model_uuid_1"], "metrics": [ "frac_dist_mean_error", "frac_time_mean_error", ], "data_gen": { "sql_path": "data.sql", "sql_type": "presto", "parameters": { "days_back": "7", "regions": "('LAX', 'SFO', 'SEA', 'CHI')" }, "postprocess": { "frac_time_mean_square_error": { "processor": "mean_squared_error", "arguments": { "y_true": "time", "y_pred": "predicted_time" } }, "frac_dist_mean_square_error": { "processor": "mean_squared_error", "arguments": { "y_true": "dist", "y_pred": "predicted_dist" }

}

} }, "validation": { "rule_path": "drift_detection_rule.json" } }

}

Similar to the anomaly detection workflow, the performance drift detection workflow is run on a schedule, typically daily. Model owners can be alerted about violations or check the performance of their production models on a dashboard.

Performance Drift Detection Visualization for a Model

This monitoring technique is the most effective at detecting intricate model performance issues. However, it requires high involvement from the model owner to provide the data query, transformation logic, and expectations on the output. An additional challenge to this approach is having reliable, timely ground-truth data in the first place.

Why Build instead of Buy?

When we started to develop a platform solution for monitoring all ML models in early 2020, there weren’t third party solutions available that met all of our needs. The main drivers for building our own solution and leveraging open-source libraries were deep integration with our existing ML platform, e.g. ensuring a model is instrumented before we allow deployment, and avoiding lock-in with a commercial offering in this nascent field. As monitoring models was of high urgency and impact for Lyft, given how critical the models’ sustained performance are to our business, the engineering effort of building a robust solution was easy to justify.

Results

In this section, we discuss some of the success stories and learnings from building and rolling out our Full-Spectrum Model Monitoring system.

Wins

Catching bugs: With our model monitoring system, many issues have been prevented before they hit production. Feature validation examples include a required feature being omitted from the model scoring request due to a client code update, a high null-rate for a feature because of an upstream data-pipeline issue, and a mismatch in a feature’s name during training and serving due to a typo. A lot of these problems can be caught in staging before deploying the model to production unless the issue occurs while the model is already live.
Deprecating or retraining old models: The push to have all models monitored also shed light on older models that were not being actively used or which had performance deterioration. These models were either deprecated or retrained.
Adoption: Today, over 90% of our production models have Feature Validation and Model Score Monitoring and 75% have Performance Drift Detection or Anomaly Detection. We have fired hundreds of alarms and caught over 15 high impact issues in the nine months following the internal general availability of this system
Testimonials: One of the most important aspects for a platform solution is user satisfaction and buy-in. Here are a few comments we received:

“MLP’s model monitoring made it a snap to set up input validation for our recommendation models. The documentation and example notebook were very helpful in onboarding, and we were able to register validation checks within four hours.” (Data Scientist)

“… I forgot to add new features to the feature list when registering the model and because I set null checks to those features it triggered the alert” (Data Scientist)

“We were able to deploy feature validation for three models in about two days. The proportional observability and reliability benefits are immense, though! Kudos to the MLP for making this such a smooth experience.” (Software Engineer)

“This feature provides exactly what our team needed for a long time. Now, all of our models are enabled in this framework which adds great observability for our systems.“ (Data Scientist)

Learnings

Building great tools is only half the story. In order to drive adoption, we had to make significant cultural changes. For most data scientists, operational concerns of running their model in production are typically not top of mind. Data scientists, however, do know the expectations of their model’s features and predictions best and need to be closely involved in setting up monitoring. Therefore, we invested significant effort in making onboarding as smooth as possible as well as selling the monitoring offerings internally through brown bags and partnering with product teams directly. The last step after seeing healthy organic adoption was to make monitoring for all models mandatory going forward and programmatically enforcing it to ensure best practices are followed.
Automate as much as possible. Even better than providing a seamless onboarding experience is to remove it entirely. We auto-generated Feature Validation expectation suites by profiling logged data for old models since it was particularly difficult to get owners to go back and instrument them. Model Score Monitoring and Anomaly Detection are also automatically provided to all models.
Monitoring is a defensive investment. Trying to estimate the impact of a model monitoring system is akin to quantifying the value of a smoke detector. A fire is needed to realize the impact of a monitoring system: if the system is working well, it alerts on the first flare before it becomes an expensive problem. This can make it difficult to prioritize over initiatives that more immediately and demonstrably affect the bottom line.
Monitoring will reduce shipping velocity slightly. By enforcing monitoring for all models there is an additional prerequisite in the productionzation of models. The effort of setting the model’s data expectations up is marginal compared to the whole development cycle but still affects the shipping velocity. We are convinced that this is a worthwhile trade-off between rigor and agility, especially at the scale of Lyft’s ML usage and business overall.

We would like to thank the following team members of the Machine Learning Platform team for their contributions to building Model Monitoring at Lyft:

Jonas Timmermann, Yang Zhang, Han Wang, Shiraz Zaman, Craig Martell, Willie Williams, Mihir Mathur, Anindya Saha, Konstantin Gizdarski, Sallie Walecka, Vinay Kakade, Drew Hinderhofer, Hakan Baba