Building Pinterest’s new wide column database using RocksDB

Rajath Prasad, Senior Engineering Manager

Pinterest serves more than 480M monthly users and has grown to be a global destination for visual inspiration. As Pinterest has grown, so have our storage requirements. In 2020, anticipating the growing needs of the business and to simplify our storage offerings, we decided to consolidate our different key-value systems in the company into a single unified service called KVStore. While KVStore was the client facing abstraction, we also built a storage service called Rockstorewidecolumn: a wide column, schemaless NoSQL database built using RocksDB. This blog post goes into the details of how we built this massively scalable, highly available wide column database using RocksDB, and provides information about the data model, APIs, and key features. Additionally, the last section explains how this new database supports a key platform in the product.

Wide Column Databases

Before we dive into the details, let’s first understand what a wide column database is. While a simple key value database can be viewed as a persistent hash map, a wide column database can be interpreted as a two dimensional key-value store with a flexible columnar structure. The key difference compared to a relational database is that the columns can vary from row to row, without a fixed schema. The diagram below shows the difference between a simple key value structure and a wide column structure.

Figure 1: Difference between a simple key value database and a wide column database

RocksDB

RocksDB is an open-source embedded key-value store developed and maintained by Meta. It is built on the concept of Log-Structured Merge (LSM) tree, which is an alternative approach to the traditional B-tree data structures used in many databases. RocksDB is designed to provide high performance, scalability, and efficient storage for various workloads. It is written in C++ and offers bindings for several programming languages, making it accessible for developers in different environments.

RocksDB is a single node key value store. In order to build a distributed and replicated service using RocksDB, we built a real time replicator library: Rocksplicator. Rocksplicator handles the real time replication and the cluster management operations needed for a highly available, fault tolerant, distributed storage service. Numerous services have been built using Rocksplicator and have been serving Pinterest well for the past five years.

Motivation

As explained in this blog post, in 2019, Pinterest had four different key-value services with different storage engines including RocksDB, HBase, and HDFS. Maintaining these disparate systems and building common functionality among them was adding a huge overhead to the teams. A working group within the company did a deep-dive on requirements and technologies, analyzed tradeoffs and benefits, and came up with a final proposal to unify all the existing key-value use cases to a single service with RocksDB as the backend storage engine. While the company was already operating Rockstore, a key-value system based on RocksDB, this service only supported a simple key-value model. UserMetaStore, a service based on HBase, supported the wide column key-value model. In order to migrate the use cases from UserMetaStore, we had to build a wide column key-value database based on RocksDB.

This blog post focuses on how we designed and implemented a wide column database service using the simple key-value interface provided by RocksDB.

Data Model

The wide column data model is defined as:

Dataset: all the data belonging to a particular use case (analogous to a table in a relational database)

Row: the smallest unit that stores related data.

Individual rows constitute a dataset.
Row key uniquely identifies a row in a dataset.
Row consists of pairs of column names and values called items.

Item: consists of a column name and a list of cells

Cell: consists of a pair of timestamp and column value (data)

The column name and a timestamp uniquely identifies a column value in a row

Figure 2: Logical view of a wide column database. The key point to observe here is that the database is schemaless, and each row need not contain all columns. All names, addresses, phone numbers are illustrative/not real.

Storage View

The underlying storage engine is RocksDB, which uses a simple key value structure (no concepts of columns or timestamps). To build the logical view described above using a simple key value structure, we designed the following storage model:

Every RocksDB key is a concatenation of row key, column name, and timestamp of the update operation.
The value of the column is the RocksDB value.
While data is sorted in ascending order by default in RocksDB, in our case, we use a custom comparator to sort the keys in descending timestamp order so that the latest versions of a value are accessed first.

Figure 3: Storage view of the example in Figure 2. All names, addresses, phone numbers are illustrative/not real.

API

The database supports the following APIs:

Get Row

The API takes the following parameters as input:

Dataset name
Row key
Column names (optional)
Number of versions (optional)

The API returns the column values of the row key for the specified columns and number of versions (starting with the latest version first). If no columns are provided as input, all the columns for the specified row are returned.

To implement this API, we make use of the prefix scan and seek feature in RocksDB by providing the prefix as the row key and the column name. RocksDB is iterated until the end of the prefix or until the number of versions required is reached. If no column name is provided, all the entries starting with the row key as the prefix are scanned, discarding any versions which are older than the number of versions requested.

Put Row

The API takes the following parameters as input:

Dataset name
Row key
List of items (Tuples of column name, column value and timestamp)

Every single ‘put’ operation is an append operation in RocksDB (i.e. no in-place modifications are done). The API caller has the option of specifying the timestamp of the operation. If no timestamp is specified, server timestamp is used. This timestamp is also used to support the Time to Live (TTL) feature (explained later).

To ensure atomicity of writes at the row key level, all writes to a row key in an API call are wrapped in a RocksDB write batch.

Delete Row

The API takes the following parameters as input:

Dataset name
Row key
List of column names (optional)

If the column names are provided, this API deletes the entries for the column names specified. If no column names are provided, all the entries for the row key are deleted.

For all the above APIs, we support a batch version so that clients can operate on batches of keys with a single API call.

Features Supported

Versioning of values is a feature to store multiple versions of data for the same key and column. This feature can be useful for use cases that want to store a history of values for a particular key. It can also be used to model an append-only data model, which we will explore later in this post.

To support versioned values in the database, every new update (including overwrite) is modeled as a new entry. By including the timestamp in the RocksDB key, each new row will contain a unique timestamp (in milliseconds) and create a new entry in RocksDB. This converts the two dimensional wide column shown in Figure 1 to a three dimensional structure with time as the third dimension.

Figure 4: Logical view of a wide column database with versioned values

For each dataset, clients can configure the number of versions to store for each value in a dataset. A full compaction of the database is run daily to discard the older versions which are no longer needed. During query time, clients can choose the number of versions of the data to be returned.

TTL

TTL is a mechanism to set the expiration of data in a dataset. Once a value has expired, it can no longer be retrieved from the dataset, and it will eventually be physically deleted from the database. Many use cases use this feature to automatically clean up older data that could be no longer valuable or required. This also helps to keep the size of the dataset in check and reduces storage costs.

Clients can configure the TTL at the dataset level. The timestamp in the key is used to determine if the value has expired. The TTL is enforced in two scenarios:

Real time reads: When a key is read, its timestamp is checked to see if the value has expired (a value is considered as expired if current timestamp > timestamp in key + configured TTL). If the entry has expired, the value is not returned as part of the response. No deletes are issued at this point to physically remove the data.
Daily compaction: During the daily compaction process, if a value is determined to have expired, it’s removed from the database.

Pagination of Responses

Rockstorewidecolumn does not enforce any limits on the number of columns stored per key, and we have use cases in production storing thousands of columns for a single row. Since the API supports the functionality of getting all the columns for a key, a request to get thousands of columns in a single request can overwhelm the database. To avoid this, the Get Rows API supports pagination by returning a marker to the caller when the response reaches a specified limit (default 100 columns). The caller then provides the same marker in the subsequent call to get the next batch of values. This continues until all the columns have been returned. The marker returned in intermediate calls is used to avoid rescanning the same set of columns, and jump directly to the next column to be returned.

Time Range Query

This feature supports returning the versions of values within a specified time range. By default, the Get Rows API returns the latest N versions of a particular value. Clients can optionally choose a time range to query the data from. This is supported by taking in start_timestamp and end_timestamp as input from the caller and using the timestamps to create a range of keys to scan.

Out of Order Updates

By providing a timestamp for an update, a caller can choose to insert an ‘older’ value into the database. This can be useful during backfill scenarios, or when the client wants to use the timestamp associated with the value (example: event time).

Data Export

Databases are exported for two purposes:

Backups for disaster recovery
Exports for offline analysis

Backups

To backup a RocksDB instance, we make use of the checkpoint backup feature to create a snapshot of the data in a separate location on disk, and then upload it to AWS S3. In case of disasters or to restore data to an older version, the backups stored in S3 are copied back to local disks and restored.

Exports

The backups in S3 are in the SST format supported by RocksDB and are not directly consumable by client workflows. This is due to two main reasons:

SST data format is not optimized for access by offline query engines (like presto and spark)
The SST files are not compacted — they contain duplicate keys (across different SST files of the same database), tombstones of deleted data, and will require specialized logic to parse them.

To make it easier to consume this data, the files are compacted using a standalone RocksDB cluster and then converted to sequence/parquet files for easy consumption by offline query engines.

Supporting Large Scale User Sequences

In this section, we will describe a major platform built using Rockstorewidecolumn. The main requirement for the datastore supporting large-scale user sequences at Pinterest was to provide a persistent, append-only log for storing user events. While this does not look like a traditional key value use case, the feature set of Rockstorewidecolumn made it compelling for the client team to use it.

Problem Statement

A Pinterst user performs many kinds of actions: repins, pin clicks, searches, comments, and many more. These realtime user actions play a critical role in various ML applications, especially for large-scale sequential modeling applications. Knowing a user’s recent actions in real time can provide a huge boost in improving the recommendations for a user. The problem statement to the storage team was to provide a realtime database to store and serve the last few relevant actions of a user.

Data Model

The following data model was used for this use case:

Row key: user_id
Column name: event type (repins, searches etc.)
Column value: event_id

To store multiple events for each event type, we used the versioned values feature. Each new incoming event is stored as a new version in the column, and the client team can configure how many older events (versions) they would like to store. Older versions that exceed this configuration will be cleaned up using the daily compaction job.

During query time, the clients provide the user_id, event types, and the number of events to query. The latest N events are returned for each event type in reverse chronological order (latest first).

Figure 5: Data model of user sequences

As mentioned in the blog post, the large-scale user sequence project has gained widespread adoption in the company driving significant user engagement gains.

In Production

Rockstorewidecolumn has been in production more than two years at this point. What was initially built as a service to take over use cases from UserMetaStore has seen tremendous growth in the number of new use cases onboarded. In the past two years, we have onboarded more than 300 production use cases, and the service handles millions of requests per second, storing petabytes of data. These use cases span critical real time workloads requiring single to low double digit millisecond latency.

While this post went into details focused on the data model and RocksDB, another post describing the distributed nature of the system (including sharding, replication) will be published in the future.

Acknowledgements

I would like to thank the following people for their significant contributions to the work presented above: Guodong Han, Jessica Chan, Indy Prentice, Kangnan Li, Neil Enriquez, Jia Zhan, Mahmoud Eariby, Bo Liu, Madeline Nguyen, Harold Cabalic, Vikramjit Singh, Rakesh Kalidindi, Gopal Rajpurohit and our partners in the SRE and infrastructure organizations.

To learn more about engineering at Pinterest, check out the rest of our Engineering Blog and visit our Pinterest Labs site. To explore and apply to open roles, visit our Careers page.