Scaling Nextdoor’s Datastores: Part 3

In this part of the Scaling Nextdoor’s Datastores blog series, we’ll explore how the Core-Services team at Nextdoor serializes database data for caching while ensuring forward and backward compatibility between the cache and application code.

In part 1 of this series we discussed how ORMs, object-relational mapping frameworks, help abstract away database specific schemas and queries from application code. Developers simply utilize objects in their application’s language to access database data.

Here’s a simple example of using Python’s Django ORM to define a model:

from django.db import models

class Users(models.Model):

first_name = models.CharField(max_length=30)
last_name = models.CharField(max_length=30)

The associated SQL create table would look like:

CREATE TABLE users (
"id" bigint NOT NULL PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
"first_name" varchar(30) NOT NULL,
"last_name" varchar(30) NOT NULL
);

Developers would then access database data like this:

user_id = 123
user = User.objects.get(id=user_id)
print(user.first_name)

Object Byte Serialization for Caching

An issue arises when adding a look-aside cache such as Redis/Valkey to an application: How do you store what you got from the database in the cache?

A common solution to caching complex objects, such as those from ORMs, is object byte serialization. This process converts language objects into bytes before storing them in the cache. When reading from the cache the process is done in reverse where the byte data is turned into language objects. For instance in Python this is often done with the pickle package.

The interaction between the application, database, and the cache looks like this:

Look-Aside Cache

import pickle

user_bytes = cache.get("user_123")

if user_bytes is not None:

user = pickle.loads(user_bytes)

else:

user = User.objects.get(id=123)

user_bytes = pickle.dumps(user)

cache.set("user_123", user_bytes)

While this method enhances performance by reducing database queries and leveraging cached data, it also presents challenges, particularly when serialized data is tightly coupled to specific runtime environments, package versions, and schema definitions.

Issues with Byte Serialization

Bound to Runtime Version and Package Version: Serialized objects often embed information about the object’s structure as defined by the code at the serialization time. When the code or its dependencies are updated (e.g., a new release), serialized objects may fail to be deserialized, making the cache data incompatible with the new code version.

Thundering Herd Problem during Migrations: When a schema migration occurs (e.g. adding a new field/column), the cache might suddenly contain a mix of old and new serialized data. As a result it can force the application to treat many cache entries as misses due to deserialization failures. If the cached items are large or in high demand (“hot”), a simultaneous cache miss across many clients can result in a “thundering herd” effect. In this scenario, numerous processes will concurrently query the database to refill the cache, placing excessive load on the database.

Author’s Note: The implied solution to deserialization errors is to query the database and re-fill the cache.

Forward and Backward Compatibility

To address the challenges discussed above, it is important to design a cache serialization strategy that ensures both forward and backward compatibility.

Forward Compatibility: Older versions of the application should be able to read cache entries that were written by newer application versions.
Backward Compatibility: New versions of the application should be capable of reading cache entries that were written by previous application versions.

Serialization format of choice

When evaluating our options, we compared various serialization formats based on versioning compatibility, the performance of serialization and deserialization, and the resulting serialized byte size. Ultimately, we chose MessagePack to serialize Django Model objects.

Forward and Backward Compatibility

For forward compatibility, the MessagePack Python library can ignore new fields during deserialization if they don’t exist in the current version of the code. This ensures that older versions of the code can still deserialize cache entries from an updated data model.

Author’s Note: Since we rarely remove fields from Django Models, older code typically doesn’t encounter missing fields in the cache. When field removal is necessary, we use Django’s SeparateDatabaseAndState to decouple database schema changes from model updates.

For backward compatibility, when the new code version deserializes old cache entries, it populates any missing fields from the cache with their default values. We require developers to provide default values when adding new fields.

Performance

To evaluate MessagePack’s performance, we conducted extensive tests comparing its standalone serialization/deserialization performance and its performance when integrated with our application. Our findings showed that MessagePack is slower than some alternatives (e.g., pickle). However, in our application tests, this difference did not noticeably affect overall latency when handling web requests.

Serialization Byte Size

MessagePack produces a smaller byte stream compared to other formats. Additionally, when combined with compression methods, we can further reduce the size of the serialized data.

Implementation Details

To support additional use cases and potential future revisions to cache storage, such as compression, we prepend additional information to the serialized value prior to writing it to the cache.

The prepended header is composed of two parts: a 2-byte metadata field and an 8-byte (64 bit) version field. For example, a serialized value might look like this:

\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01

The first 2 bytes (\x00\x00) represent the metadata where we store serialization format information.
The following 8 bytes (\x00\x00\x00\x00\x00\x00\x00\x01) encode object version information.
The remaining bytes are the MessagePack-serialized data.

We’ll explore the role and necessity of the version information bytes, and how we use them in Part 4: Keeping the Cache Consistent.

Conclusion

Ensuring forward and backward compatibility in caching alone wasn’t enough to deliver accurate data to users — we also needed a way to maintain cache consistency with our database. Maintaining cache consistency is crucial in a distributed environment, where we are handling concurrent web requests while delivering up-to-date data to users. To learn more about the importance of cache consistency and how we achieve it, check out Part 4: Keeping the Cache Consistent in the Scaling Nextdoor’s Datastores blog series.