Optimizing object storage

A deep dive into Vimeo’s storage strategy for videos.

Published in

Vimeo Engineering Blog

9 min readAug 30, 2023

In today’s data-centric world, smart storage optimization is key, especially for large-scale applications. If left unchecked, storage can quickly become a massive portion of overall operational costs. Especially given the exponential growth of data libraries, a robust, data-driven strategy is imperative to manage enterprise scale storage libraries. Here’s our technical deep dive into optimizing Vimeo’s video library on Google Cloud.

Hosting optimization: The three pillars

When you think of hosting optimization, it largely revolves around three significant components:

Optimizing compute. This involves the judicious use of computing resources, selecting appropriate cloud instance types, implementing scaling policies, and optimizing application performance.
Optimizing content delivery. This is the strategic positioning of content in edge locations, possibly combining with intelligent routing to accelerate content delivery while managing costs.
Storing objects smartly. The storage area’s choice relative to its access frequency can have a significant impact on costs. Hot storage, which provides quick data access, is more expensive, while cold storage is cost-effective but offers slower data access.

We zeroed in on the third pillar: storing objects smartly. Total storage costs accounted for a significant chunk of our hosting expenses. It was evident that this should be our initial focus in our quest to cut costs.

Our primary aim was to establish a universally effective lifecycle policy for our source storage buckets. These buckets primarily house source content. This content typically isn’t accessed frequently. As a result, retrieval speed and associated costs aren’t significant concerns. To clarify, when we mention source content, we’re referring to the original, raw video data that Vimeo users upload to us.

The predominant costs in this scenario arise from the duration that the data is stored, the volume of data, and specific operations like data writing or transfer. Since the demand for source objects is predictably low, we formulated the challenge as a constrained optimization problem.

But first, what are lifecycle policies?

Simply put, Google Cloud Storage’s lifecycle policies enable automatic management of objects in a bucket based on specified conditions like age, version, and so on. Meeting these conditions triggers actions such as deletion or the transitioning between warmer, more expensive storage classes to colder, less expensive storage classes. These age-based life cycle conditions are what we wanted to control through analysis.

For context, Google Cloud Storage offers four primary storage classes:

Standard. This is the warmest and most costly, for frequent access.
Nearline. This is for infrequent access, say at an interval of roughly once per month.
Coldline. Getting chilly! This is for access at roughly once per quarter.
Archive. This is the coldest and least costly, for long-term archiving.

With that, consider the following examples of age-based policies:

Videos unedited for 30 days and neither viewed nor opened in the past 15 days transition from standard to nearline storage.
Videos over a year old get deleted.

Such rules help us to govern when videos should progressively be moved to cheaper storage classes as they age. Despite their straightforward nature, these rules, derived from optimizing the cost function across a range of constraints, managed to slash our source storage costs by about 60 percent.

How did we determine these rules? Read on!

Phase 1: The storage calculator for source storage buckets

Before diving into our optimization challenge, here are the constraints and nuances that informed the mathematical framework:

Cost. Colder storage tiers are more cost-effective per gigabyte of storage.
Access latency. Colder tiers can have higher retrieval times.
Access frequency. Colder tiers typically come with higher retrieval costs, making them unsuitable for frequently accessed data.
Availability. Colder tiers provide lower service level agreements, or SLAs.
Minimum applicable charges. Colder, cheaper tiers have higher minimum requirements: standard, 0 days; nearline, 30 days; coldline, 90 days; archive, 365 days. This means that, for example, archived content must remain in archive storage for a minimum of 365 days. Moving it out before then incurs costs; more on that a little later in this article.

Given these trade-offs, selecting an optimal storage class for each data object is a complex task. The challenge is to minimize storage costs without compromising data accessibility and performance requirements. Availability concerns are out of scope for this exercise, since we use content delivery networks — CDNs — to optimally cache and deliver videos to our users.

We arrived at the following cost optimization equation (see Figure 1).

Figure 1. The equation calculates cost by summing over four storage classes, factoring in individual storage and retrieval costs, size of the data, its lifespan, and adjustments for minimum required storage days.

The primary challenge was: can we pinpoint when objects should shift between storage tiers, based on size, retrieval frequency, and associated bucket costs?

To develop this age-based calculator, we:

Collected data on video age, access frequencies, and current storage costs.
Outlined the constraints as stated above. For example, cheaper storage tiers have higher retrieval costs and deletion penalties.
Solved the optimization equation by linear programming to search through thousands of policy combinations, aiming to identify the globally optimal storage policy that minimizes cost.

Phase 2: Machine learning to tackle compute storage

Building on our successful optimization of source storage, our next significant challenge was devising an intelligent strategy for managing Vimeo’s extensive video library, a mix of historical assets and continuously uploaded content. For this, we relied on machine learning, or ML.

In this context, we classify storage into two categories: hot and cold. For videos in the non-source library that are frequently accessed, we utilize Google’s nearline class as our hot storage. Less frequently accessed videos are moved to Google’s coldline class, serving as our cold storage. Implementing an ML-based solution to tackle compute storage resulted in a reduction of our storage costs by about 20 percent.

Let’s dive into the ML techniques behind this win.

1. The ML approach broadly

Supervised learning is a machine learning paradigm where a model is trained on labeled data. The model makes predictions based on this training. Unsupervised learning, on the other hand, deals with unlabeled data, finding patterns and structures from the data itself.

With that in mind, here are the models we chose:

For supervised learning, we employed LightGBM, a gradient boosting framework known for its speed and accuracy.
For unsupervised learning, K-means clustering was leveraged for feature engineering and dimensionality reduction.

Our decision to use these specific models was primarily driven by their robustness and speed in handling vast datasets.

2. Diving into the data

Logs, logs, everywhere! From storage logs from Google Cloud Operations Suite to logs from various CDNs, we tapped into multiple sources, providing us with comprehensive insights about video interactions and performance. The dual approach of understanding when objects are fetched from caches (through CDN logs) and when directly accessed from storage (via storage logs) gives us a holistic view of video access patterns.

3. Feature engineering: Augmenting domain insight with automated machine learning, or AutoML

Feature engineering, especially manual feature engineering, is essential as it enables domain context to be instilled into model development. To understand CDN delivery, storage access, and viewing patterns of video assets, we employed both manual and AutoML-generated features. AutoML helped derive hundreds of potential features from our dataset very quickly. Manual feature engineering then enabled us to fine-tune, discard the noise, and emphasize genuinely valuable insights. By merging the strengths of both manual and automated techniques, we achieved a more tailored, effective representation of our data. Figure 2 shows Shapley plots used to understand the feature importance of our generated features.

**Figure 2.** This Shapley feature importance plot, or SHAP, displays the contribution of each feature towards the model’s prediction, ranked by average impact on the output. The SHAP values help to interpret the model’s decisions by highlighting the most influential features for storage classification.

Here are some key feature engineering insights:

Decay features. These are designed to give more weight to recent interactions. For each interaction with a video object, assigned weight decreases exponentially as the interaction gets older. For example, the frequency of access within the past week weighs higher than the frequency of access within the past 30 days.
Retrieval metrics. Normalizing retrieval as percentages allows for a consistent comparison across video assets of varying sizes, ensuring that the metric’s scale does not skew interpretations. A more mathematically precise way to understand normalized retrieval metrics would be: for a given time frame, what fraction of a video asset’s size is fetched? For instance, if CDN logs indicate that 500 MB of a 1 GB video asset has been retrieved, the CDN retrieval ratio equals 50 percent for that specified period.
Moving averages. These distinguish activity trends between CDN retrievals and direct Google Cloud Storage storage fetches. For instance, a sharp increase in the moving average might indicate a spike in direct storage retrievals, signifying potential cache misses in the CDN.
avg_perct_diff_cdn_retrieval. This custom metric captures the mean percentage discrepancy between CDN retrievals in the time period considered for inference. This retrieval is expressed as a percentage retrieval of video asset size.
first_access_since_creation. This custom metric denotes time to first access since video asset creation.
days_between_first_last_access. This custom metric indicates the total active life of an object in terms of its interactions.
count_of_peaks and count_of_valleys. These custom metrics encapsulate the volatility in object access frequencies.
AutoML for time series. With the tsfresh Python package, our feature engineering arsenal expanded. The package applies predefined mathematical techniques, transforming time series data into static variables suitable for classification. It automates the extraction of basic statistics to complex characteristics.

4. Ground truth label assignment

In supervised learning, ground truth labels represent the true values or outcomes for specific instances, guiding the training process to generalize from these labeled examples.To assign these ground truth labels, we banked on existing video content with a rich history — those with at least 12 months of logs. We sampled these logs uniformly across a year. Videos were labeled as cold when the net cost of storing them, factoring in their retrieval and read costs, was positive. Conversely, they were branded hot when the net costs turned negative.

5. Inference and implementation

We use Kubeflow Pipelines for recurring storage classification inference tasks, ensuring timely decisions with the latest data. Figure 3 shows our inference flow.

**Figure 3.** Flowchart illustrating the model inference process using Kubeflow. The sequence begins with video uploads, followed by an 8-week data collection period. Subsequent steps involve data processing, feature engineering, and model inference. The results are then outputted to a CSV file and finalized by setting **Days-Since-Custom_time** headers in Google Cloud Storage.

After inference, predictions are organized into CSV files, with each row specifying a video’s recommended storage tier. We use the Days-Since-Custom_time headers in Google Cloud Storage to automate video transitions to appropriate storage tiers based on the model’s recommendations, eliminating manual oversight.

6. What happens to content in hot storage that cools down in demand over time?

A fair question — hot storage is meant for data that needs to be frequently accessed. Leaving files in hot storage that over time are accessed infrequently is economically inefficient.

The solution? Set a uniform age-based policy that will, over time, transition less frequently accessed (or colder) objects to cold storage, optimizing costs. The uniform policy would be based on the retrieval distribution for content that remains in hot storage over extended periods.

7. What happens to content in cold storage that suddenly becomes hot again?

Transitioning content from cold storage back to hot storage poses challenges. Cold storage is designed for infrequent data retrieval, so premature retrieval or deletion can incur costs.

One deterrent to transition from cold to hot are the deletion penalties when data is moved before its stipulated, minimum time in cold storage. However, if the content has suddenly surged in popularity or relevance, such that it’s more economical to incur the deletion penalty than to keep it in cold storage, then a move back to hot storage might be justifiable. Thankfully, these instances are rare.

The robustness of our classification model is evidenced by an F1 score that’s high, with precision and recall at 78 percent and 84 percent, respectively (Figure 4 shows the normalized confusion matrix).

**Figure 4.** Confusion matrix for hot and cold video storage classification, displaying true positives, false positives, true negatives, and false negatives.

Figure 5 shows our precision-recall curve.

**Figure 5.** Precision-recall curve for the video classification model. This curve illustrates the tradeoff between *precision* (the accuracy of the positive predictions) and *recall* (the fraction of positives that were correctly identified) at different thresholds. The area under the curve, or AUC, of 0.83 signifies a strong capability of the model to balance and differentiate between videos classified as hot and cold effectively.

Furthermore, error analysis is a vital step in our model evaluation. By pinpointing which items were incorrectly earmarked for cold storage, we gain valuable insights. Error analysis has been instrumental in refining our feature engineering approach, ensuring our model remains agile and precise in its subsequent predictions.

8. Training and validation details

Training and validating our model wasn’t a straightforward task. We delved deep using time-series cross validation, ensuring our data splits made sense chronologically. Bayesian hyperparameter optimization helped us tweak our model to its best form. But the work didn’t stop there. By analyzing where our model went wrong, we made tweaks and improvements, iterating on our feature set. If you’re keen on the nitty-gritty of these processes, keep an eye out for an upcoming article on this blog where I’ll dive deeper into these technical aspects.