Behind Viewer Retention Analytics at Scale

[

zeev

](https://medium.com/@ZeevFeldbeine?source=post_page---byline--8dbbb5ae7ae2---------------------------------------)

Making videos is hard! Analyzing viewer retention and engagement of your videos is even harder. At Vimeo, we handle well over a billion videos of varying lengths and diverse viewing patterns, constantly growing in scale and complexity. Providing viewer retention analytics and AI-driven insights at second-level granularity instantly for any timeframe (historical, monthly, daily) and video length is crucial.

In this post I outline Vimeo’s architecture and approach for delivering viewer retention analytics with millisecond-level latency, discuss how we leverage AI for actionable insights, and highlight key challenges faced at scale.

What is viewer retention and why does it matter?

Think of viewer retention as the measurement of a video’s ability to hold attention. This metric reveals the extent to which the audience engages with your video, helping to understand exactly which moments keep them hooked, or even cause them to leave.

At Vimeo, we visualize viewer retention as a simple XY graph (see Figure 1). The x-axis represents the video’s progression in seconds, while the y-axis indicates the number of views per second. You can even break down viewer retention data by specific users, dimensions (such as region, device domain), or particular time periods.

Figure 1. Viewer retention sample with retention going down as the video progresses. Note the AI insights box in the upper right.

To complicate this even more, If a viewer replays the same second multiple times within one session, each replay is counted, using the same type of logic for capturing behaviors like loops, scrubs, and skips.

Understanding these patterns is crucial for optimizing video content, interpreting audience engagement, and identifying the key moments.

What problem are we trying to solve?

Viewer retention analytics in the big data world is challenging — especially when dealing with granular, continuously evolving session data for long-form videos. Precisely capturing detailed viewer actions (skips, scrubs, repeats, drops) in an append-only OLAP storage like ClickHouse becomes increasingly difficult as data volumes grow. The key question is: how can we achieve sub-second query latency while tracking real-time, fine-grained events across historical and current data?

This breaks down into three core challenges:

Capturing granular user actions (skips, repeats, scrubs, drops) within video sessions.
Delivering sub-second query latencies at any scale, timeframe, or granularity in real time.
Deriving and recommending actionable viewer retention insights with AI. Summarizing a complex viewer retention graph or fully understanding it can be a challenging task by itself.

From raw events

Before I jump into all the cool stuff, it’s important to understand what data we are actually collecting and using for viewer retention. I will be brief.

Video analytics are structured around sessions. Each session represents a single user’s video viewing activity characterized by dimensions such as region, device, source domain, and so on. Each session comprises micro-events like video load, play, periodic heartbeats, and user actions.

Heartbeats are sent every 10 seconds to track continuous viewing; interruptions like skips, pauses, and scrubs trigger partial heartbeats and resume from the new position in the video. This is all it takes to build viewer retention.

But does it tell the whole retention story?

Not quite. To accurately record user behavior, each heartbeat includes two essential indices: the previous and end positions in the video, in seconds. Consider the following example of a 30-second video with a heartbeat frequency of 10 seconds.

Here’s the breakdown for continuous viewing:

Heartbeat1: previous_index=0, end_index=10
Heartbeat2: previous_index=10, end_index=20
Heartbeat3: previous_index=20, end_index=30

But for interrupted viewing, where the user watches until second 7 and then scrubs to second 25:

Heartbeat1: previous_index=0, end_index=7
Heartbeat2: previous_index=25, end_index=30

This user viewed only 12 seconds of actual content, skipping the majority of the video. Capturing data at this granularity ensures accurate, real-time aggregation of metrics across sessions, sub-sessions, and any dimension — without waiting for sessions to conclude or risking data loss.

Storage and data modeling with ClickHouse

As I explained in my previous post, we store analytical events in an open-source, append-only OLAP database — ClickHouse. However, traditional approaches using updates or overwrites to track views per second per index aren’t efficient or scalable. Instead, we store changes in views per index rather than absolute counts. This granular approach ensures better accuracy, consistency, and flexibility without overhead.

But how exactly do we capture real-time viewer changes per index? Leveraging the heartbeat structure outlined earlier, we set a +1 for the previous_index and a -1 for the end_index. Summing these changes over each second accurately captures real-time viewing changes behavior across all sessions.

For example, with continuous viewing:

H1: [previous_index=0, +1], [end_index=10, -1]
H2: [previous_index=10, +1], [end_index=20, -1]
H3: [previous_index=20, +1], [end_index=30, -1]

Summing these changes per index results in correct viewing activity of one view from second 0 to 30. In greater detail, the calculation based on index aggregation looks like the following:

Index = 0, views = H1(+1) = +1
Index = 10, views = H1(-1) + H2(+1) = 0
Index = 20, views = H2(-1) + H3(+1) = 0
Index = 30, views = H3(-1) = -1

Which tells you that a user started viewing from the beginning (the 0s mark), watched all the way through, and left at the end (the 30s mark) of the video.

But with interrupted viewing, where the user gets to the 7s mark and then scrubs to 25s:

H1: [previous_index=0, +1], [end_index=7, -1]
H2: [previous_index=25, +1], [end_index=30, -1]

This accurately reflects the viewer skipping the 7–25s segment. Try doing this aggregation exercise by yourself to see.

Keep in mind that this logging structure treats each heartbeat or action independently, minimizing coupling and maximizing flexibility and aggregation across all billions of micro events that come daily.

We store these events in an AggregatingMergeTree (in our case replicated since we have a distributed cluster; see my earlier post), ordered by dimensions and granular time indices, aggregating changes efficiently. You can further enhance the model by adding additional dimensions or materialized views for more granular analytics.

The following is a null table for ingestion, used by materialized views. Remember, null tables don’t store any data:

CREATE TABLE vimeo.viewer_retention_null_table ON CLUSTER analytics( video_owner_id UInt32, video_id UInt32, session_id UInt32, session_ts UInt64, end_index Nullable(UInt32), previous_end_index Nullable(UInt32), country Nullable(String), device_type Nullable(String), referer Nullable(String), ..... )

ENGINE = Null;

Here’s the definition of the aggregated and distributed table by the owner, video, time, and second indices in the video:

CREATE TABLE vimeo.viewer_retention_local ON CLUSTER analytics(

owner_id UInt32 CODEC(ZSTD(1)),

video_id UInt32 CODEC(ZSTD(1)),
session_ts DateTime CODEC(Delta,ZSTD(1)),
timeIndex UInt32 CODEC(ZSTD(1)),
changeInViewers SimpleAggregateFunction(sum, Int64) CODEC(ZSTD(1)) )

ENGINE = ReplicatedAggregatingMergeTree

PARTITION BY toYYYYMM(session_ts)
PRIMARY KEY (toStartOfDay(session_ts), owner_id, video_id)
ORDER BY (toStartOfDay(session_ts), owner_id, video_id, session_ts, timeIndex)
TTL session_ts + toIntervalYear(1) TO VOLUME 'slow'
SETTINGS index_granularity = 8192;

CREATE TABLE vimeo.viewer_retention ON CLUSTER analytics AS viewing_analysis.viewer_retention_local

ENGINE = Distributed(analytics, vimeo, viewer_retention_local,cityHash64(owner_id,toStartOfWeek(session_ts)));

Notice we aggregate only the change in viewers per second in the video, which represents the change of positive or negative views in that second.

The following materialized view is where the true logic of viewer retention ingestion magic happens:

CREATE MATERIALIZED VIEW vimeo.viewer_retention_mv ON CLUSTER analytics TO vimeo.viewer_retention
AS
SELECT owner_id, video_id, session_ts,

timeIndex_pair.1 AS timeIndex,

SUM(timeIndex_pair.2) AS changeInViewers
FROM (
SELECT
video_owner_id AS owner_id,
video_id AS video_id,
toStartOfHour(fromUnixTimestamp(toUInt64(session_ts))) AS session_ts,
arrayJoin([(coalesce(previous_index, 0) , 1), (coalesce(end_index, 0) , -1)]) AS timeIndex_pair
FROM vimeo.viewer_retention_null_table )

GROUP BY owner_id,video_id,session_ts,timeIndex

HAVING changeInViewers != 0;

As in the previous example, notice we attach +1 to the previous_index, and -1 to the end_index for the heartbeat log within the arrayJoin function. This arrayJoin function expands each heartbeat row into two rows for start and end indices.

The query transforms viewer heartbeats data from a wide to tall format, enabling us to aggregate and calculate the net change in viewers at each time index and in addition filter events with no index impact (where the change in viewers isn’t 0) to reduce the table size.

But how do you query and get viewer retention?

Leveraging ClickHouse, computing viewer retention becomes straightforward: first, aggregate the changes in views at each time index, and then apply a window aggregation to sum all preceding changes. This accurately computes total views at every second by accumulating previous increments and decrements.

Just as past experiences shape our present, each time index reflects all prior viewing events, resulting in precise, real-time viewer retention insights.

The following is the SQL to derive the above:

SELECT timeIndex,

sum(change_sum) OVER (ORDER BY timeIndex ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS views // window func based on previous view changes

FROM(

SELECT

timeIndex AS timeIndex,
sum(changeInViewers) AS change_sum
FROM vimeo.viewer_retention
WHERE (owner_id = 1234) AND (video_id = 64543) AND (session_ts >= toDateTime('2024-04-01 00:00:00')) AND (session_ts <= toDateTime('2025-04-01 23:59:59'))
GROUP BY timeIndex
HAVING change_sum != 0
ORDER BY timeIndex ASC
)

As you might have noticed, this involves a small subquery. While you can simplify it using ClickHouse’s FINAL, it’s best practice to avoid FINAL unless necessary due to performance overhead.

The final query provides accurate viewer retention — the total views per second. (Feel free to verify the earlier examples against this query to ensure they align correctly.)

All of this is very cool, but where is the AI?!

The issue with any complex analytics and graphs such as viewer retention, it’s good mostly for n̶e̶r̶d̶s̶ insiders. But what about fast insights or actionable insights that can be derived by the best of the best data analysts with advanced knowledge of statistical patterns, especially for long videos or complex viewing patterns? This is exactly where LLM-derived insights shine. See Figure 2 for some architecture.

Figure 2. Architecture of the flow from getting the data to calling the LLM API with data processing.

But is this it?

Feeding viewer retention data directly into an LLM API at the query level, meaning each analytics request calls the LLM separately and can quickly become expensive, inconsistent, and slow. Here’s how we optimized our solution to achieve reliable performance at scale.

Data pre-processing

To minimize input tokens and improve efficiency, especially for long videos, we avoid feeding granular second-by-second retention data directly into the LLM. Instead, we apply two key techniques:

Window averaging. For longer videos, we group time indices and use their average values, smoothing out noise and fluctuations.
Run-Length Encoding (RLE). If there’s no change in views between consecutive index groups, why include redundant data points? By applying RLE, we remove these consecutive horizontal lines, keeping only the meaningful changes in the graph.

These approaches significantly reduce data size and lower token usage, latency, and cost without sacrificing accuracy.

Prompt engineering: where the magic happens

We all know the real magic of using commoditized LLM models lies in prompt engineering, combined with quality data. That’s all it takes to build a powerful AI-driven product.

Our simplified role-playing technique with a structured input prompt looks like this:

Role: Data analyst specializing in video retention analytics at Vimeo.Task: Pinpoint drop-off timestamps, high retention sections, unusual patterns, and suggest useful optimizationsData Schema:....Metadata:* Viewer sessions track events like consistent viewing, scrubs, pauses, interruptions, and completions.* A view is counted as '1' ...* Leaving or scrubbing to another point marks the timeIndex as '-1'...* Views data represents total views for each timeIndex across all sessions, including repeats...Data:{}Considerations:* Avoid common patterns such as high retention at the beginning and lower retention at the end.* Focus on specific, non-intuitive patterns or trends for improvement.* Non-intuitive patterns are those that are not immediately obvious or expected, such as a sudden increase or drop in views at a specific timeIndex......Output Guidelines:* Be concise and informative.* Limit response to 50 words or less.

....

As you can see, we explicitly instruct the LLM to skip the obvious and generate a concise summary, 50 words or less, about viewer retention. The prompt clarifies how each part of the template contributes to better results and narrowed scope of the insights. This focused approach caps the output token size and both the total latency and overall LLM cost as a result.

By applying targeted preprocessing in the form of window averaging and RLE, we significantly narrow and clarify the data context — resulting in faster, cheaper, smaller input token size, and more consistent outputs.

Model selection and cost

While prompt engineering is at the heart of any AI-driven solution, choosing the right model is crucial. Smaller models are faster and cheaper but can struggle with complexity, while larger models offer deeper insights at higher costs.

Since we primarily use Google Cloud Platform, we currently leverage Gemini Flash 2.0 for complex viewer-retention scenarios and rely on the efficient Gemini Lite 2.0 for most standard use cases — achieving the best balance of cost, performance, and accuracy.

Figure 3 demonstrates AI-generated insights based on viewer retention graph processing:

Figure 3. Viewer retention graph with a clear spike in the middle of the video and a notable drop in the beginning and the end

The Gemini Lite output looks like this:

After a sharp initial drop, views stabilize from second 20-60.A significant, unexpected views spike occurs around second 390(6.5min), suggesting a particularly engaging segment worth analyzing. Retention plummets after second 900(15min),

indicating a need to re-evaluate later content.

With gratitude

None of this could have been possible without the analytics team, so thank you, everyone, for making it possible.

I could go into even more detail about any aspect of what I’ve covered. Ask me a question in the comments and watch what happens.

As for viewer retention data, our team retention is extremely high, and you can possibly be part of it. See our open positions.