[placeholder]

Why We Built a Write Back Cache for Our Asset Library with Google Cloud Spanner

Why We Built a Write Back Cache for Our Asset Library with Google Cloud Spanner

At Squarespace, we aim to help our customers catch their customers’ eyes. Media assets, like images and videos of products or services, are important to show what our customers have to offer. So, the process of adding and iterating on media assets should be easy and fast. For example, when people need to reuse the same image, they shouldn’t have to upload it again.

We provide an Asset Library for our customers to manage their media assets. Users can search for and sort assets, organize them into folders, delete them, and restore them from the trash.

View of Asset Library, and preview images

As engineers, we are always thinking ahead but we don’t always do the big thing upfront. We prefer to move fast, deliver an MVP, and then improve on it. We are open to changes and we want our changes to make real differences. In this blog post, we’ll talk about one such journey.

What is Alexandria?

The Asset Library UI is backed by a service called Alexandria. We named the service after the historically significant Egyptian library. Alexandria creates media asset records during the upload process and organizes those assets into libraries where users can manage and select from their uploaded content.

An asset library is backed by three types of Google Cloud Storage (GCS) objects:

  • Segment file(s) holding all the asset records. These asset records contain important metadata about the assets in the library,such as color information for images or format for videos.

  • A header file with a manifest of the library's segments, and other library level information, such as folder structures.

  • A trash can for assets and folders that have been deleted but can still be restored back to the library.

How the asset libraries are stored in GCS.

The Problems We Had

When we were designing Alexandria, we wanted fast and reliable but also cost effective storage. From the beginning, we focused on read performance by having both in memory cache and GCS. Since only a very small portion of our libraries are actively used at any given moment by somebody uploading or picking media, for example while editing a website, we decided to keep the libraries in GCS and load them into an in-memory cache when they are needed. While GCS as long term storage is reliable and relatively cheap, the in-memory cache is fast and more expensive. We made that cost as little as possible by only loading libraries into the in-memory cache when they were in use. 

So we were happy with the read performance. Write performance can be better, however. GCS is object storage and it’s not designed for low write latency. In its original state, every write operation in Alexandria needed to update one or more of the GCS objects (header, segments or trashcan) synchronously. For example, when a user deleted an asset, Alexandria deleted the asset record from its segment file and added it to the trash can. Alexandria could only confirm that the delete was successful after everything was persisted in GCS, so the performance of these writes to GCS was reflected in the response time the user saw for this delete operation.

The first issue we ran into was GCS’s one write per second rate limit for a given object. This rate limit presents a problem for rapid updates to a single library, for example, when a user is doing a bulk upload, or when we're running a migration to import data into Alexandria. To work around this limitation, we implemented logic to combine multiple changes into a single write, a process sometimes called coalescing. This coalescing logic solved the rate limiting problem, but it was difficult to read and test. Bugs were usually subtle and hard to replicate outside of high volume load tests. This made changes to that part of the code more risky and time consuming than we would have liked.

The next issue was the long tail of the write latencies. Most GCS writes are fast enough, but a very small percentage are not. So users occasionally had to wait a long time after clicking upload, and some uploads even timed out. 

So, at this point, we had both user experience and developer experience issues we wanted to improve.

How We Solved It

To solve these two issues effectively, we decided to introduce a write back cache. Essentially, we solved the write performance issue in the same way we guaranteed fast read performance: a fast, expensive, but small cache on top of cheap and reliable long term storage.

Write back caching is a storage technique. Updates are written to the write back cache as they happen, and flushed to long term storage at a later time. Writes to the write back cache are usually a lot faster than writes to the long term storage, so having a write back cache layer can improve latency and throughput because users won’t be waiting on the slower writes.

We chose to use Google Cloud Spanner for our write back cache for the following reasons:

  • Spanner is fast and without a write rate limit.

  • Spanner is highly available, it guarantees up to 99.999% availability.

  • Spanner provides external consistency. With this guarantee we will continue to make sure user and client experiences are consistent. Once an operation succeeds, they will see the result reflected in their asset library.

To illustrate, here is the original request lifecycle for write operations in Alexandria:

Figure 1 -- Original request lifecycle in Alexandria.

With write back caching, we introduced four changes to this workflow:

  • When loading a library from GCS into Alexandria’s in-memory cache, we apply pending changes from the write back cache so the cached library is up-to-date.

  • If the target library is already in the cache, we make sure it’s up-to-date.

  • We write updates to the write back cache and respond to users immediately, instead of adding library updates to a batch to be written out to GCS.

  • We flush changes in the write back cache out to long-term storage in GCS asynchronously and clean them up from the cache.

With these changes, the updated request lifecycle looks like this:

Figure 2 -- Updated request lifecycle in Alexandria. Note the dashed arrow for asynchronous flush

With this change, we were able to remove GCS writes from the critical path of user requests and get rid of the write latency long tail. We were also able to remove the write-coalescing logic introduced to avoid hitting GCS’s write rate limit, improving the readability and maintainability of our code.

Why Not a Write Ahead Log?

Write ahead log is a similar technique to write back cache. A write ahead log is an append-only file of records. It's a history of all updates, while a write back cache only has the latest state. For example, consider a user updating one asset twice. With a WAL, there will be two records for each update. With a write back cache, the second update overwrites the first update and it only has the end state of that asset record after both updates. We decided to use a write back cache because for both loading from the cache and flushing updates to long term storage, we only need the end state.

Why Not a Queue?

You may also be wondering, why not use a queue? Asynchronously committing changes might sound like a classic queue problem, but in our use case, each library is logically its own database. Transactions are scoped per library; we always read and write updates for a single library at a time. When we load the libraries from GCS into the in-memory cache, we want to be able to apply any unflushed updates. It's not ideal to scan a queue that includes updates for every other library as well. We’d need millions of queues -- a queue for each library -- which didn’t seem practical. With queues we would also be having a write ahead log which is more than what we needed as discussed earlier.

The Result

To release this change safely, we created a canary fleet of Alexandria deployed with the write back cache branch. We load tested to verify everything worked fine, and gradually moved production traffic to it. The following is a graph of the write endpoint’s request latency while we were ramping up to 100%. The top blue line is for the old fleet and the bottom green line is for the canary fleet. The average latency is about 30% of what it used to be, while our p99 is about 10% of what it used to be, and also much more stable.

Looking Ahead

Alexandria is always evolving. With all the new development, such as AI generated images and videos, media assets will be even more important for our customers’ online presence. And we are committed to making that process easier. If you are interested, come join us!

Outro: What You Did and What’s Next

Outro: What You Did and What’s Next