Building a large scale unsupervised model anomaly detection system — Part 2

By Rajeev Prabhakar, Han Wang, Anindya Saha

A camera lens looking at a city downtown

In our previous blog we discussed the different challenges we faced for model monitoring and our strategy for addressing some of these problems. We briefly mentioned using z-scores to identify anomalies. In this post, we will dive deeper into anomaly detection and building a culture of observability.

Model observability is often neglected but is critical in the machine learning model lifecycle. Developing a good observability strategy helps to narrow down problems quickly at their roots and take appropriate actions such as model retraining, improving feature selection, and troubleshooting feature drifts.

The example below is what our finished product looks like. The highlighted regions are the timeframes where anomalies were detected. With a dashboard that contains the corresponding features, it becomes quick to diagnose the root cause of the anomaly. This blog post discusses the approach to building a fully automated solution that finds and explains anomalies.

Utilizing Data Profiling

In our part-1 blog, we talked about the importance of data profiling.

Although it is common practice to monitor anomalies based on specific aggregated metrics on raw data, the question remains, which metrics are helpful. For outliers, minimum, maximum, and 99th percentile metrics are very useful. For numerical distribution drifts, mean and median are effective metrics. For categorical data, frequent items are useful for detecting categorical drift and cardinality for the data quality overall. It is evident that various metrics are needed, and these requirements can also change over time. Recomputing metrics for large datasets can be very slow and cost prohibitive.

This is the main reason to leverage data profiling. We chose whylogs because we can build profiles of various functional, integral and distribution metrics with one pass. Another reason to choose whylogs for us is its low latency and the ability to fit into a MapReduce framework.

After data profiling, the anomaly detection problem is converted to a standard time-series problem on the smaller profiles. We can use general approaches and tools that can be applied to other business contexts.

Anomaly Detection Design Principles

Below are the main factors we considered when building an anomaly detection solution. Because we already created a time series of profiles in the previous step, the solution focuses on leveraging the forecasted confidence intervals to find anomalies.

Versatility

To ensure adoption across the broader organization, the solution needs to be general and flexible to plug into most, if not all, business use cases. Anomaly detection was historically lacking because of the amount of domain-specific logic for each implementation. The anomaly detection solution aims to create something general-purpose that can serve as the first line of defense. For more critical applications, more detailed business rules can be applied on top of the general solution.

Balancing Accuracy and Speed

Striking a balance between accuracy and speed is crucial when it comes to time-series forecasting. To ensure we’re using the best tool for the job, we evaluated several popular forecasting libraries, including Facebook Prophet, LinkedIn Greykite, and Nixtla StatsForecast. After careful consideration, we’ve decided to adopt StatsForecast for time-series anomaly detection due to its exceptional performance.

Regarding accuracy, Statsforecast provides a wide range of statistical and econometric models to forecast univariate time series. With this package, we can easily choose from different models like ARIMA, MSTL, ETS and Exponential Smoothing that are neatly wrapped behind the same caller interface, enabling us to evaluate and generate forecasts from multiple models simultaneously with just a few lines of Python code. The models implemented in Statsforecast have been written from scratch, and have shown incredible performance in recent forecasting competitions. StatsForecast also includes several experiments testing the performance of their models.

When it comes to speed, Statsforecast really stands out. Its models run at impressively fast speeds, thanks to its effective use of Numba and parallel computing. This means we can generate forecasts quickly and efficiently, without compromising on accuracy. With its powerful combination of accuracy and speed, Statsforecast is the perfect tool for our time-series anomaly detection needs.

Scalability

A general solution should work on both small and big datasets. The problem is that this would mean maintaining a Pandas solution and a Spark solution. Fugue, an open-source abstraction layer for distributed computing that brings Python and Pandas code to Spark, is allowing users to define their logic with the local packages they are comfortable with, and scaling it with minimal wrappers. Abstracting away the execution engine allows us to focus solely on defining the logic once.

Identifying Potential Anomalies

For time-series anomalies, although the z-score based approach is effective and SQL friendly, it generates too many false positives unless the thresholds are adjusted for every scenario. We also employ an advanced machine learning approach with a low false positive rate without per-scenario adjustment.

Generating forecasts

With Statsforecast’s models, we do have plenty of options to choose from based on the characteristics of data. Statsforecast also allows the ability for a user to run multiple models at once on the same timeseries without any noticeable additional time. This gives great flexibility for a few teams who are early on in the journey of anomaly detection. Below is an example choosing AutoARIMA as the model to generate forecasts.

AutoARIMA forecasts using Statsforecast

In addition to obtaining forecasts, we can access the in-sample prediction values of each model. We use this to analyze anomalies on historical data.

Isolating Anomalies

An anomaly is any data that deviates from the confidence interval. From the below graph, we can see the anomalies we found in predictions for the model based on “distribution/mean” metric. Only the days where anomalies were reported are highlighted.

Anomalies highlighted on model predictions

Exploring Anomaly Root Cause

Anomalies by themselves seldom offer any actionable insights. In an ML model, explaining a prediction anomaly resulting from input feature drift can provide valuable context for understanding the anomalies.

Now that we have found the anomalies, let’s try to understand if we can identify the reason for the anomalies. The above graph does not give any insights into why there was an anomaly. In this section, we will investigate how reasoning anomalies are related to feature drift.

Creating Feature Profiles

In a stable machine learning system, the aggregation metrics of features often remain predictable (with clear seasonality and trend) over time. In this section, let’s examine how feature drift over time impacts the predictions of the models.

Just like the profile of prediction values, we have all of the metrics from profiles available for every feature. Below is an example of time series generated from “distribution/mean” values from the features profile

Distribution/Mean of features of the model

Feature Drift Impact

Since we are trying to identify a feature drift that corresponds to a change in prediction values, we will train a regressor with changes in features over consecutive time periods and measure changes in predictions for the same time interval. We will then use Shapely values to explain the model.

Importance of feature drift relative to the prediction

The bar chart above tells us the stack rank of feature drift importance with respect to the prediction value change. We can see that a change in request_latency had a profound impact on the prediction value of the model. Using this information, plotting the identified allows enables us to examine them.

Analyzing the features

The highlighted red lines (anomalous hours from the prediction data) co-inside with a spike and drop in request latency, causing anomalies in the input prediction.

Monitoring and Alerting for Anomalies

It is important to have an effective communication channel for the discovered anomalies. We built a simple dashboard using Mode Analytics. Although these dashboards provide good insights into the model, the benefit is only realized with timely action on detected anomalies.

To address this, we send soft alerts through Slack instead of integrating with services like PagerDuty. Each model has a model owner and team with a Slack handle. Upon detection of any anomaly, the corresponding Slack channel gets notified.

Note that the identified anomalies are based on the past performance of the metric. While we can choose a good confidence interval to minimize false positives, it is inevitable that they will occur. In our experience, using systems like PagerDuty can lead to alert fatigue for teams, and leads to a loss of interest or trust in the system generating alerts.

Applications

In this section, we discuss the different scenarios where anomaly detection systems have been used.

Taking Timely Action on ML Models

Each model is automatically onboarded onto the anomaly detection system since we rely on system logs rather than user setup. This has drastically reduced the turnaround time to action on broken models without explicit user action to setup for anomaly detection.

Monitoring of business data/metrics

The prediction anomaly detection for ML models relies on hourly or daily profiles. Generalizing this concept, these are time series data. Building out the anomaly detection platform allowed us to plug in any business metrics with a defined time interval. Our Operations team uses anomaly detection on some of the most critical business metrics to get timely alerts and look at historical trends for forecasting corrections.

Real-time anomaly detection

We are also experimenting with using the Statsforecast package to generate forecasts for future time horizons (such as the next two days) and then compare the real-time values with the forecasted values to determine whether the real-time value is anomalous. Real-time values are considered anomalous if they fall outside the confidence bounds of forecasts, and we notify users in real-time when such deviations occur. This allows us to catch anomalous predictions within a few minutes.

Acknowledgments

Special thanks to Shiraz Zaman and Mihir Mathur for Engineering and Product Management support behind this work.