Web Performance Regression Detection (Part 2 of 3)

Michelle Vu | Web Performance Engineer;

Fighting regressions has been a priority at Pinterest for many years. In part one of this article series, we provided an overview of the performance program at Pinterest. In this second part, we discuss how we monitor and investigate regressions in our Pinner Wait Time and Core Web Vital metrics for desktop and mobile web using real time metrics from real users. These real time graphs have been invaluable for regression alerting and root cause analysis.

Alerts

Previously, our alerts and Jira tickets were based on a seven day moving average based on daily aggregations. Migrating our alerts and regression investigation process to be based on our real time graphs paved the way for faster resolution on regressions for a few reasons:

Immediately available data with more granular time intervals means regressions are detected more quickly and accurately.

More granular time intervals allow us to see spikes more clearly, as they typically occur over the short time span it takes for an internal change to rollout (usually less than 30 minutes).
Additionally, regressions are easier to detect when the previous two weeks of data is used as a comparison baseline. Spikes and dips from normal daily and weekly patterns will not trigger alerts, as the delta between the current value and the previous weeks doesn’t change. An alert only triggers when a regression spikes beyond the max value from the previous two weeks for that same time of day and day of the week. Warning alerts are triggered after the regression is sustained for 30 minutes, while critical alerts accompanied by a Jira ticket are triggered after the regression is sustained for several hours.

Figure 2: Alerts are based on the delta between the current value and the max value from the previous two weeks for that same time period

2. A clear start time for the regression significantly increases the likelihood of root-causing the regression (more details on this below under “Root Cause Analysis”).

3. It is much easier to revert or alter the offending change right after it ships. Once a change has been out for a longer period of time, various dependencies are built upon it and can make reverts or alterations trickier.

Root Cause Analysis

For regressions, our real time graphs have been pivotal in root cause analysis as they enable us to narrow down the start time of a production regression down to the minute.

Our monitoring dashboard is built to be a live investigation runbook, progressing the investigator from Initial Investigation steps (done by the surface owning team) to an Advanced Investigation (done by the Performance team).

Initial Investigations

Steps for the Initial Investigation include:

Check if there are any other surfaces that started regressing at the same time (any app-wide regression investigations are escalated to the Advanced Investigation phase done by the Performance team)
Identify the start time of the regression
Check deploys and experiments that line up to the start time of the regression

Determining the exact start time of the regression cuts down on the possible internal changes that could cause the regression. Without this key piece of information, the likelihood of root-causing the regression drops significantly as the list of commits, experiment changes, and other types of internal changes can become overwhelming.

Internal changes are overlaid on the x-axis, allowing us to identify whether a deploy, experiment ramp, or other type of internal change lines up with the exact start time of the regression:

Figure 3: Internal changes are displayed on the x-axis, making it easy to see which changes occurred at the start time of the regression

Knowing the start time of the regression is often sufficient for identifying the root cause. Typically the regression is due to either a web deploy or an experiment ramp. If it’s due to a web deploy, the investigator looks through the deployed commits for anything affecting the regressed surface or a common component. Generally the list of commits in a single deploy is short as we deploy continuously and will have 9–10 deploys a day.

Occasionally, it is difficult identifying which internal change caused the regression, especially when there are a large number of internal changes that occurred at the same time as the regression (we may have an unusually large deploy after a code freeze or after deploys were blocked due to an issue). In these situations, the investigation is escalated to the Performance team’s on-call, who will conduct an Advanced Investigation.

Advanced Investigations

Investigating submetrics and noting all the symptoms of the regression helps to narrow down the type of change that caused the regression. The submetrics we monitor include homegrown stats as well as data from most of the standardized web APIs related to performance.

Steps for the Advanced Investigation include:

Check for changes in log volume and content distribution

Figure 4: In an Advanced Investigation, we first double check surface volume metrics to identify if log volume or content distribution changes are causing the regression

2. Determine where in the critical path the regression is starting

Figure 5: The next step is to check the surface timing metrics, which can show where in the critical path the regression is starting

3. Check for changes in network requests

Figure 6: We also check if there are any changes in network requests that may indicate the source of the regression

The real time investigation dashboard shown in the above images is limited to our most useful graphs. Depending on the findings from the above steps, the Performance team may check additional metrics kept in an internal Performance team dashboard, but most of these metrics (e.g. memory usage, long tasks, server middleware timings, page size, etc) are used more often for other types of performance analysis.

Last year we added two new types of metrics that have been invaluable in regression investigations for several migration projects:

HTML Streaming Timings

Most of our initial page loads are done through server-side rendering with the HTML streamed out in chunks as they are ready. We instrumented timings for when critical chunks of HTML, such as important script tags, preload tags, and the LCP image tag, are yielded from the server. These timings helped root cause several regressions in 2023 when changes were made to our server rendering process.

For example, we ran an experiment testing out web streams which significantly changed the number of chunks of HTML yielded and how the HTML was streamed. We saw that the preload link tag for the LCP image was streamed out earlier than our other treatment as a result (this is just an example of analysis conducted, we did not ship the web streams treatment):

Figure 7: Real time metrics timing how long it took to stream out the HTML chunk containing the preload tag for the hero image for the different streaming treatments tested

Network Congestion Timings

We had critical path timings on the server and client as well as aggregations of network requests (request count, size, and duration) by request type (image, video, XHR, css, and scripts), but we didn’t have an understanding of when network requests were starting and ending.

This led us to instrument Network Congestion Timings. For all the requests that occur during our Pinner Wait Timing, we log when batches of requests start and end. For example, we log the time when:

The 1st script request starts
25% of script requests are in flight
50% of script requests are in flight
…
25% of script requests completed
50% of script requests completed
etc.

This has been invaluable in root-causing many regressions, including ones in which:

The preload request for the LCP image is delayed
Script requests start before the LCP preload request finishes, which we found is correlated with the LCP image taking longer to load
Script requests complete earlier, which can cause long compilation tasks to start
Changes in other image requests starting or completing earlier or later

Figure 8: Real time metrics timing how long it took for 25% of script requests to finish vs. how long it took for the preload image request to finish

These metrics along with other real time submetrics have been helpful in investigating tricky experiment regressions when the regression root cause is not obvious from just the default performance metrics shown in our experiment dashboards. By updating our logs to tag the experiment and experiment treatment, we can compare the experiment groups for any of our real time submetrics.

When the Performance team was created, we relied on daily aggregations for our performance metrics to detect web regressions. Investigating these regressions was difficult as we did not have many submetrics and often could not pinpoint the root cause as hundreds of internal changes were made daily. Keeping our eye on PWTs and CWVs as top level metrics while adding supplementary, actionable metrics, such as HTML streaming timings, helped make investigations more efficient and successful. Additionally, shifting our alerting and investigation process to real time graphs and continually honing in on which submetrics were the most useful has drastically increased the success rate of root-causing and resolving regressions. These real time, real user monitoring graphs have been instrumental in catching regressions released in production. In the next article, we will dive into how we catch regressions before they are fully released in production, which decreases investigation time, further increases the likelihood of resolution, and prevents user impact.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.