The Quest to Understand Metric Movements

Charles Wu, Software Engineer | Isabel Tallam, Software Engineer | Franklin Shiao, Software Engineer | Kapil Bajaj, Engineering Manager

Overview

Suppose you just saw an interesting rise or drop in one of your key metrics. Why did that happen? It’s an easy question to ask, but much harder to answer.

One of the key difficulties in finding root causes for metric movements is that these causes can come in all shapes and sizes. For example, if your metric dashboard shows users experiencing higher latency as they scroll through their home feed, then that could be caused by anything from an OS upgrade, a logging or data pipeline error, an unusually large increase in user traffic, a code change landed recently, etc. The possible reasons go on and on.

At Pinterest, we have built different quantitative models to understand why metrics move the way they do. This blog outlines the three pragmatic approaches that form the basis of the root-cause analysis (RCA) platform at Pinterest. As you will see, all three approaches try to narrow down the search space for root causes in different ways.

Figure 1: Narrowing down the search space for root causes.

Slice and Dice

This approach finds clues for a metric movement by drilling down on specific segments within the metric; it has found successes at Pinterest, especially in diagnosing video metric regressions.

For example, suppose we are monitoring video view rate (i.e., number of views over impressions). At Pinterest, a metric like video view rate is multidimensional: it has many dimensions like country, device type, Pin type, surface, streaming type, etc., that specify which subset of users the metric is describing. Using the different dimensions, we can break down the top-line metric into finer metric segments, each segment corresponding to a combination of dimension values. We are interested in identifying the most significant segments that have either significantly contributed to a top-line metric movement or have exhibited very unusual movements themselves not reflected in the top-line.

How we are analyzing the metric segments takes inspiration from the algorithm in Linkedin’s ThirdEye. We organize the different metric segments into a tree structure, ordered by the dimensions we are using to segmentize the metric. Each node in the tree corresponds to a possible metric segment.

Figure 2: Example of a segment tree.

Depending on your use-case, you could then define your own heuristics in terms of the different factors that determine the significance of a metric segment, in the context of its parent segment and/or the top-line metric. You could then synthesize the factors into an overall significance score.

The LinkedIn blog already listed several factors that we found useful, including how many data points a metric segment represents, as well as how “unexpected” the metric segment’s movement is between what is observed and what is expected, especially compared to its parent segment in the tree.

Here are some additional suggestions based on our experience that you could try:

Try tweaking how the factors are calculated; e.g., for each metric segment, what are the “observed” and “expected” values? Are they values taken at two discrete points in time or averages/percentiles of data from two time windows (i.e., one baseline window and one window in which the anomalous top-line metric movement happened)? Similarly, the metric segment size factor could also be aggregated from a time window.
Add new factors that make sense for your use-case; e.g., a factor like how well a metric segment correlates with the parent segment / top-line metric in the time window of interest.
Adjust the weights of the different factors over time based on continued evaluations.

Figure 3: Analyzing a metric segment.

Note that for each metric segment (i.e. each node in the tree) you need to select enough data to calculate all the factors. A lot of OLAP databases support SQL features (e.g., GROUP BY ROLLUP) that can get the data for all metric segments. Once the segment tree is constructed, you can also choose to drill down starting from any metric segment as the top-line.

Lastly, note that the tree structure implies an order or hierarchy in the dimensions we are slicing each time. While some dimensions can indeed relate to one another in clear hierarchical order (e.g., dimensions country and state), others cannot (e.g., dimensions country and device type). Look at it this way: if this drill-down investigation were manual, the investigator would still have to choose an order of dimensions to slice along each time, from context or experience. The hierarchy in the tree structure captures that.

General Similarity

In this approach, we look for clues of why a metric movement happened by scanning through other metrics and finding ones that have moved very “similarly” in the same time period, whether in the same direction (positive association) or in the opposite direction (negative association).

Figure 4: Positive and negative associations between metrics.

To measure the similarity of metric movements, we use a synthesis of four different factors:

Pearson correlation: measures the strength of the linear relationship between two time-series
Spearman’s rank correlation: measures the strength of the monotonic relationship (not just linear) between two time-series; in some cases, this is more robust than Pearson’s correlation
Euclidean similarity: outputs a similarity measure based on inversing the Euclidean distance between the two (standardized) time-series at each time point
Dynamic time warping: while the above three factors measure similarities between two time-series in time windows of the same length (usually the same time window), this supports comparing metrics from time windows of different lengths based on the distance along the path that the two time-series best align

In practice, we have found that the first two factors, Pearson and Spearman’s rank correlations, work best because:

p-values can be computed for both, which help to gauge statistical significance
both have more natural support for measuring negative associations between two time-series
non-monotonic (e.g. quadratic) relationships, for which Pearson and Spearman’s rank correlations won’t apply, don’t tend to arise naturally so far in our use-cases / time window of analysis

At Pinterest, one of the notable uses for this RCA functionality has been to discover the relationship between performance metrics and content distribution. Some types of Pins are more “expensive” to display, resource wise, than others (e.g., video Pins are more expensive than static image Pins), so could it be that the latency users experienced has increased because they saw more expensive Pins and less inexpensive ones as they scroll through their home feed or search feed? RCA has provided the initial statistical signals that performance regressions and content shifts could indeed be linked, motivating further investigations to estimate the exact causal effects.

Figure 5: Content shifts and latency.

It’s important to keep in mind that this RCA approach is based on analyzing correlations and distances, which do not imply causation. The stronger statistical evidence for causation is of course established through experiments, which we will turn our attention to next.

Experiment Effects

This third approach looks for clues of why metric movements happened by looking at what a lot of internet companies have: experiments.

An experiment performs A/B testing to estimate the effect of a new feature. In an experiment, a portion of the users are randomly assigned to either a control or a treatment group, and the ones in the treatment group experience a new feature (e.g., a new recommendation algorithm). The experimenter sees if there is a statistically significant difference in some key metrics (e.g., increased user engagement) between the control and the treatment group.

In RCA, we perform the above in reverse: given a metric, we want to see which experiments have shifted that metric the most, whether intended or not.

Figure 6: RCA and experiments.

Each user request to RCA specifies the metric, segment, and time window the user is interested in. Then, RCA calculates each experiment’s impact on the metric segment over the course of that time window and ranks the top experiments by impact. The RCA calculation and ranking are carried out dynamically per user request and are not part of a pre-computation pipeline (although the process may rely on some pre-aggregated data); this supports analyzing the impacts for a maximum amount of metrics, often on an ad-hoc basis, without resulting in a systematic increase in computation or storage cost.

Figure 7: RCA Experiment Effects workflow.

For each control and treatment group in an experiment, we perform a Welch’s t-test on the treatment effect, which is robust in the sense that it supports unequal variances between control and treatment groups. To further combat noise in the results, we filter experiments by each experiment’s harmonic mean p-value of its treatment effects over each day in the given time period, which helps limit false positive rates. We also detect imbalances in control and treatment group sizes (i.e., when they are being ramped up at a different rate from each other) and filter out cases when that happens.

We have integrated RCA Experiment Effects with the experimentation platform at Pinterest. With extensive application-level caching, as well as some query optimizations, we are able to have RCA dynamically find the top experiments affecting all metrics covered by the experimentation platform — close to 2000 of them at the time of writing, including a variety of system, user engagement, and trust and safety metrics.

Using It All Together

All three RCA services could be used together iteratively, as illustrated below.

Figure 8: Using all RCA services together.

Next Steps

What is presented here are just three approaches to narrowing down the search space of root-causes of metric movements. There are other ways of doing this, which we will explore and add as demands arise.

For analytics tools like anomaly detection or root-cause analysis, the results are often mere suggestions for users who may not have a clear idea of the algorithms involved or how to tune them. Therefore, it would be nice to have an effective feedback mechanism in which the users could label the results as helpful or not, and that feedback is automatically taken into account by the algorithm going forward.

Another potential area of improvement that we are looking into is leveraging causal discovery to learn the causal relationships between different metrics. This would hopefully provide richer statistical evidence for causality with less noise, compared to the current RCA General Similarity.

As we improve the RCA services’ algorithms, we would also like to integrate them with more data platforms within Pinterest and make RCA readily accessible through the platforms’ respective web UIs. For example, we are exploring integrating RCA into the data exploration and visualization platforms at Pinterest.

Acknowledgments

We are incredibly grateful to the engineers and data scientists at Pinterest, who have been enthusiastic in trying and adopting the different RCA services and offering their valuable feedback.