Why We Built a Write Back Cache for Our Asset Library with Google Cloud Spanner

When we were designing Alexandria, we wanted fast and reliable but also cost effective storage. From the beginning, we focused on read performance by having both in memory cache and GCS. Since only a very small portion of our libraries are actively used at any given moment by somebody uploading or picking media, for example while editing a website, we decided to keep the libraries in GCS and load them into an in-memory cache when they are needed. While GCS as long term storage is reliable and relatively cheap, the in-memory cache is fast and more expensive. We made that cost as little as possible by only loading libraries into the in-memory cache when they were in use.

So we were happy with the read performance. Write performance can be better, however. GCS is object storage and it’s not designed for low write latency. In its original state, every write operation in Alexandria needed to update one or more of the GCS objects (header, segments or trashcan) synchronously. For example, when a user deleted an asset, Alexandria deleted the asset record from its segment file and added it to the trash can. Alexandria could only confirm that the delete was successful after everything was persisted in GCS, so the performance of these writes to GCS was reflected in the response time the user saw for this delete operation.

The first issue we ran into was GCS’s one write per second rate limit for a given object. This rate limit presents a problem for rapid updates to a single library, for example, when a user is doing a bulk upload, or when we're running a migration to import data into Alexandria. To work around this limitation, we implemented logic to combine multiple changes into a single write, a process sometimes called coalescing. This coalescing logic solved the rate limiting problem, but it was difficult to read and test. Bugs were usually subtle and hard to replicate outside of high volume load tests. This made changes to that part of the code more risky and time consuming than we would have liked.

The next issue was the long tail of the write latencies. Most GCS writes are fast enough, but a very small percentage are not. So users occasionally had to wait a long time after clicking upload, and some uploads even timed out.

So, at this point, we had both user experience and developer experience issues we wanted to improve.

To solve these two issues effectively, we decided to introduce a write back cache. Essentially, we solved the write performance issue in the same way we guaranteed fast read performance: a fast, expensive, but small cache on top of cheap and reliable long term storage.

Write back caching is a storage technique. Updates are written to the write back cache as they happen, and flushed to long term storage at a later time. Writes to the write back cache are usually a lot faster than writes to the long term storage, so having a write back cache layer can improve latency and throughput because users won’t be waiting on the slower writes.

We chose to use Google Cloud Spanner for our write back cache for the following reasons:

Spanner is fast and without a write rate limit.
Spanner is highly available, it guarantees up to 99.999% availability.
Spanner provides external consistency. With this guarantee we will continue to make sure user and client experiences are consistent. Once an operation succeeds, they will see the result reflected in their asset library.

To illustrate, here is the original request lifecycle for write operations in Alexandria: