Keeping it 100(x) with real-time data at scale

At Figma, multiplayer collaboration is central to everything we do—from file canvas editing to features like comments and FigJam voting sessions. Keeping data up-to-date across product surfaces is critical for effective team collaboration and core to what makes Figma feel magical. LiveGraph, Figma’s real-time data-fetching service, is the foundation that makes this possible.

LiveGraph provides a web API for subscribing to GraphQL-like queries and returns the result as a JSON tree. Like other GraphQL backends, we have a schema describing the entities and relations that make up our object graph, along with views that allow querying a subset of that graph. Through our custom React Hook, the front-end automatically re-renders on update with no additional work from engineers.

A flowchart on a yellow background featuring geometric shapes and arrows indicating connections and flow. A prominent blue circle with a recycling arrow symbolizes a cyclical process, and directional arrows guide the flow between elements.

Software Engineers Rudi Chen and Slava Kim share an inside look at how we empower engineers to build real-time data views, while abstracting the complexity of pushing data back and forth.

At the end of our earlier exploration of LiveGraph, we predicted that our next bottleneck would be ingesting all database updates on each LiveGraph server. While this was certainly a factor, it was just one of the many scaling challenges we’ve faced with LiveGraph. Figma’s expanding user base and increasing LiveGraph usage mean more client sessions, each of which is increasingly expensive. The number of sessions has tripled since 2021, with view requests growing 5x in just the last year. Conversely, this growth is leading to large changes in the underlying infrastructure: From a single Postgres instance to many vertical and horizontal shards, the database below LiveGraph is shifting and we have to keep up.

Building for the future

We needed a solution that enables exponential scale. Any iteration of LiveGraph’s design had to adhere to some requirements:

  • Keep Figma fast: Uphold service-level objectives (SLOs) about initial load times and updates—any new design must maintain or improve upon our current numbers
  • Enable database scale: Natively support more vertical and horizontal database shards without compromising reliability or performance
  • Use multiple scaling levers: Scale different parts of the system depending on whether client reads or query updates are growing
  • Migrate safely and transparently: Make incremental improvements without impacting LiveGraph users

We launched an initiative called “LiveGraph 100x,” a long-term plan to efficiently scale LiveGraph’s current read and database update load by 100x. This initiative forced us to go back to the drawing board and reimagine how we would design LiveGraph’s architecture from the ground up.

LiveGraph’s growing pains

The PostgreSQL replication stream writes all primary database row mutations to a write-ahead log (WAL) to keep replica databases up-to-date. The stream includes row-level changes such as the pre- and post-image of a row, along with a monotonically increasing sequence number. LiveGraph re-uses this stream to deliver live updates. As a result, all LiveGraph queries hit the primary database.

Just two years ago, LiveGraph was made up of one server with an in-memory query cache that received updates by tailing the replication stream from a single Postgres instance. Our cache implementation was mutation-based: For every row mutation sent down by Postgres, the corresponding query result was directly modified in the cache.

LiveGraph’s original architecture

At this scale, a lot of things just work. For example, LiveGraph’s implementation relied on a globally ordered stream of updates, a reasonable assumption to make when there’s one primary database without replicas. But Figma’s user base continued to grow and began to strain the single Postgres instance. As our lone database splintered into a collection of vertical shards, global ordering was no longer guaranteed—many database shards produced updates at the same time, in no particular order.

When the Postgres instance reached its capacity, time was our main concern. We had to quickly make a strategic change that enabled the database to scale vertically. We put together a stopgap solution to maintain the global ordering assumption that was deeply baked into LiveGraph: artificially combining all replication streams into one.

Vertical sharding woes

LiveGraph’s vertical sharding architecture

Since there was still one ordered updates stream and one query pathway, LiveGraph’s mechanics continued to function as if vertical sharding never happened. However, we soon realized a few major limitations with this design:

  1. Excessive fan-out: As the number of sessions grew, we needed to provision more servers to match the load. But since mutations were sent to every server, the required bandwidth would soon consume too many resources, threatening to prevent LiveGraph from scaling.

  2. Excessive fan-in: As more database shards were added and the number of updates grew, so did the number of events processed at each server. At some point, this stream would be too big to efficiently process in one place, putting LiveGraph in the blocking path of database scale.

  3. Tight coupling of reads and updates: We only had one scaling lever—sizing up the fleet—which further worsened both fan-in and fan-out. There was no way to independently handle shifts in read traffic versus shifts in database updates traffic.

  4. Fragmented caches that reduce performance: Our caching strategy was wasteful. Different clients requesting the same views could connect to different servers, so the cache hit rate decreased as we scaled up the fleet. Since the cache was built into the server, it was blown away on every deploy, creating a thundering herd that was getting bigger by the day.

  5. Large product blast radius from transient failures: LiveGraph provides optimistic updates to power UX surfaces requiring immediate feedback from user actions. This interface provides a seamless experience by showing “shadow state” from user changes on the front-end and carefully removing it once the change has been persisted on the back-end.

    In our vertical sharding architecture, every database shard participated in every user optimistic update. The global stream could move forward only if each shard was producing updates. This meant that transient blips had outsized effects on users—if one shard was unavailable, all optimistic updates stalled (no comments could be made!), though only a fraction could have been affected by the particular failure.

In distributed systems, a thundering herd occurs when a resource is overwhelmed by too many simultaneous requests. The name is derived from the vision of a huge herd of bulls coming at you; there’s nowhere to run. On every LiveGraph deploy, all clients reconnected at the same time and hit an empty cache, thundering the database.

LiveGraph optimistic updates power comments in Figma

These limitations were so fundamental that incremental fixes and optimizations would be insufficient. Fortunately, our stopgap solution bought us enough time to re-architect LiveGraph at a more foundational level. As Figma continued to grow and the database team began to gear up for horizontal sharding, this project became existential. We needed a new scaling plan—and soon.

Data-driven engineering: a new design

One initial scaling strategy we discussed was to move LiveGraph’s query cache into its own service and shard it the same way as the database. On the surface, this seems like a natural progression. However, this meant that LiveGraph would have to know about database topology to correctly route queries—exactly what another service at Figma, dbproxy, does. More importantly, LiveGraph would be implicated in every database re-sharding operation, too much complexity for us to take on. This left us in an open-ended position with many architecture variations floating around. Our final design was motivated by a couple of key insights into the system.

First, we realized that LiveGraph traffic is driven by initial reads. As we dug into traffic patterns, we observed that the larger part of client data is delivered due to initial reads rather than live updates. This might seem counterintuitive considering that LiveGraph was created to transmit real-time updates, but in fact most query results never change after initial load. This meant that we could optimize for high-read traffic and afford to be slightly suboptimal on the update path without impacting user experience. Specifically, our cache could be invalidation-based without overloading the database, because invalidations on active queries are infrequent. To verify this, we instrumented our mutation-based cache to obtain a conservative estimate on how an invalidation-based cache would perform. What we found further confirmed our theory: Most updates only cause a few queries to be invalidated, so the fan-out from invalidations is limited.

The historical motive for a mutation-based cache was load: When Figma ran on a single primary instance of Postgres, the database was extremely sensitive to any spikes. Issuing too many queries at once could topple the database, so we used a mutation-based cache to deliver new results without re-queries. This was helpful at the time, but with database scaling, capacity is not nearly as precious.

An invalidation-based cache simplified several parts of the system. Caches need to be notified only about the possibility of changes to query results, rather than receive each exact change itself. Re-querying the database on invalidation always yields the newest result, so ordering of updates is insignificant and our singular stream could be broken up. Optimistic updates are also streamlined; rather than waiting for a change to flow through the system, clients simply invalidate and re-query the cache to receive new data.

Fan-in before and after LiveGraph 100x

At this point, we had decided on a global invalidation-based cache sharded by query hash, but our invalidation strategy was unclear. Since LiveGraph sessions are long-lived, cache nodes need to be aware of all active queries and watch for invalidations on them. One could imagine that caches subscribe to these queries in an invalidator service. But eventually, this approach would fall into an unscalable pattern because the number of active queries would be too large for a single invalidator to hold. Luckily, with one fundamental insight we were able to make the invalidator service completely stateless.

After thoroughly analyzing the LiveGraph schema, our second key discovery was that most LiveGraph queries are easy to invalidate. To do this, we built a schema inspection tool that analyzed different query structures and their frequencies. We found that in almost all cases, given a database row mutation, it is straightforward to figure out which queries should be re-fetched. This makes it possible to correctly generate invalidations for all queries without actually needing to keep track of which ones are actively subscribed to, as long as invalidators are aware of the shapes that queries can take (more on this later).

Systems where it is non-trivial to calculate the affected queries must be designed quite differently, so this finding was significant. For us, it meant that stateless invalidators could be aware of both database topology and the cache sharding strategy to deliver invalidations only to relevant caches, removing the excessive fan-in and fan-out of all database updates.

This process taught us the value of taking a deep look at an existing system and letting its usage drive the design. Realizing invalidations are infrequent and most queries are easy to invalidate was pivotal, and we had to invest directly in observability tools to come to this conclusion. At the end of this journey and many, many design docs later (seriously, I saw docs named Livegraph 100x, livegraph-100, and Live-Graph 100x—is the G capitalized? Is there a dash?), materialized a new multi-tier architecture.

LiveGraph 100x: A new architecture

LiveGraph 100x is written in Go and made up of three services:

  • The edge handles client view requests, expands these into multiple queries, and reconstructs the results into loaded views to return to the client. It subscribes to queries in the cache and re-fetches queries on invalidation to deliver newer results to the client.
  • The read-through cache stores database query results and is sharded by query hash. It is agnostic to database topology and only consumes invalidations within its hash range. On invalidation, it evicts entries from its in-memory cache before forwarding the invalidation to the edge using a probabilistic filter. By keeping hot replicas on standby and deploying the cache separately from the edge, thundering herd is no longer a recurring issue.
  • The invalidator is sharded the same way as the physical databases and tails a single replication stream. It is the only service with knowledge about database topology. Upon a database mutation, it generates invalidations to send to relevant cache shards.

Put all together, it looks something like this:

New architecture for LiveGraph 100x

This new architecture addresses all of our previous concerns:

  • No more excessive fan-in and fan-out: With clever sharding and probabilistic filters, fan-in and fan-out at each node is limited. Moving forward, we can increase capacity by adding more caches and edges without overwhelming node-to-node bandwidth.
  • Natively supports database sharding: LiveGraph can support many vertical and horizontal shards, without introducing additional complexity to re-shard operations.
  • Caches are not fragmented: A global invalidation-based cache reduces memory usage and code complexity. By deploying the cache and edge separately, we remove the possibility of a thundering herd event associated with deploys.
  • Product is resilient to transient failures: Optimistic updates use straightforward logic and are robust against temporary disruptions.

Over the last year and half, we have been putting this new architecture into production. Since we wanted to ship LiveGraph 100x incrementally, we first targeted the least scalable component of our old stack: the cache and its downstream dependencies. There were two central challenges we faced in this migration.

Invalidations: easy or not?

We previously mentioned that most invalidations are easy. But how does this actually work? Queries in LiveGraph are derived from our schema’s object graph. This schema evolves on a human scale, changing on a day-to-day basis with code updates, unlike the sub-second frequency required for generating invalidations. This means we can distribute query information to LiveGraph 100x services prior to users’ requests for them, guaranteeing that invalidations will be generated by the time they are needed.

For example, let’s say we wanted to query comments on a given file. The schema for this would look like:

TypeScript

type File {
 id: String!
 name: String!
 updatedAt: Date!
 comments: [Comment] @filter("Comment.fileId=id AND Comment.deletedAt=null")
}

Our cache essentially serves SQL queries—we convert edges between objects into SQL filters to query against Postgres. So the above edge from File → comments would give the database query:

SQL

SELECT * FROM comments WHERE file_id = $1 AND deleted_at = NULL

Every un-parameterized query can be uniquely identified; we call these “query shapes.” If we assign unique IDs to query shapes, then a cache key for a query is uniquely identified by its shape ID and its arguments. For example, if we call the above shape file_comments, then the query described by file_comments("live-graph-love") refers to the parameterized query:

SQL

SELECT * FROM comments WHERE file_id = "live-graph-love" AND deleted_at = NULL

Crucially, we can apply this substitution in the reverse direction as well, when receiving a mutation from the database. Given the update:

We inspect all query shapes and invalidate any queries that result from substituting values in the pre- and post- image into the shape. In this example, we would invalidate the query file_comments("live-graph-love") by substituting the column value for file_id. Note that this substitution requires knowing only the query shapes in the schema; not the entire set of active queries.

JSON

{
 "table": "comments",
 "preImage": {
 "id": "123",
 "created_at": "January 1, 2000",
 "deleted_at": null,
 "file_id": "live-graph-love",
 },
 "postImage": {
 "id": "123",
 "created_at": "January 1, 2000",
 "deleted_at": "October 3, 2004",
 "file_id": "live-graph-love",
  }
}

Hard queries

Not all cases are so simple, however! Imagine instead that we wanted to view comments in a time range:

TypeScript

type File {
 Id: String!
 name: String!
 updatedAt: Date!
 comments: [Comment] @filter("Comment.fileId=id AND Comment.createdAt > File.updatedAt")
}

This would yield the SQL query:

SQL

SELECT * FROM comments WHERE file_id = $1 AND created_at > $2

We’ll call this shape new_comments. Given the same database update as before, what queries should we invalidate? Conceptually, any query with a created_at argument before or equal to January 1, 2000 could have changed. But this would naively invalidate infinitely many queries—as many as the granularity of our time system. It would be infeasible both to generate these and propagate them throughout our system. The space of queries with possibly infinite fan-out on an update is what we define as “hard.”

Let’s take a closer look. We can further break this query down into expressions and define these as “easy” or “hard”—in our example, it’s the created_at > ? expression that makes it “hard” to generate invalidations. So, our query has the easy expression new_comments_easy:file_key = ? and the hard expression new_comments_range:created_at > ?. Hard expressions are not limited to range-based queries; we have the following easy-hard breakdown for common schema filters:

Easy expressions vs. hard expressions

Although hard queries only make up a tiny proportion of our schema—currently, only 11 out of ~700 queries—they are fundamental query patterns and we had to find a way to invalidate them. Luckily, we found that all queries in our schema are normalizable into (easy-expr) AND (hard-expr), and we require this to stay true going forwards. Queries without a hard expression simply ignore the latter part of this normalization. This is significant because it allows us to only invalidate easy expressions, ignoring hard expressions altogether.

We do this by using a clever caching trick! First, we shard caches by hash(easy-expr) rather than hash(query). By co-locating all hard queries with the same easy expression on the same cache instance, invalidations for the easy expressions can be delivered to one cache rather than fanned out to all of them. Second, we cache queries with a hard expression in two layers:

  • A top-level {easy-expr} key, which stores a nonce
  • The actual key, {easy-expr}-{nonce}-{hard-expr}, storing database results

Now, given an easy expression invalidation, we can delete the top-level key, which effectively evicts all hard queries sharing that easy expression. In the example mutation, this would evict all hard queries that share new_comments_easy("live-graph-love"). Note that this means hard query lookups have a layer of indirection—first we need to lookup the nonce, then use that to look up the query results.

Cache key structure for hard queries

With this trick, invalidations flowing through the system only specify easy expressions, until reaching the edge. At that point, matching hard queries are re-queried in the cache. Since the edge contains user sessions, only the active queries are re-queried, removing the infinite fan-out from our naive invalidation approach. A TTL is used to clean up old cache entries.

Like all architecture decisions, there’s a tradeoff: We get a stateless invalidator and fast invalidations at the expense of over-invalidating hard queries and a more restrictive schema. The former isn’t too concerning given our insight that active queries rarely get invalidated, but if the latter becomes an issue, we can loosen this restriction and build upon our invalidation strategy, thanks to our modular design.

Keeping it consistent

As you know by now, one of LiveGraph’s fundamental API contracts is to keep queries up to date. Because LiveGraph learns about updates by tailing the replication stream as opposed to methods like polling for new data, LiveGraph cannot skip invalidations at any point in the system. This leads to an interesting problem in the cache of juggling simultaneous reads and invalidations, or what we like to call the “read-invalidation rendezvous.” If an invalidation arrives during an in-progress read, there is no way to tell whether the result is from before or after the invalidation. To ensure eventual consistency, LiveGraph must re-fetch in either case.

Accomplishing this takes several steps. First, edges start listening for invalidations on a given query before asking the cache for it. This ensures that when invalidations are propagated upstream, edges are notified even if they have not yet received a result for the query. The actual read-invalidation rendezvous is conducted by a layer just above the in-memory cache. This layer ensures all operations are synchronized (aware of each other) and prevents stale entries by upholding this important behavior:

  1. Operations of the same type can coalesce. For example, two concurrent reads on the same query key will join a single-cache read. This constraint is mainly for capacity, to prevent a frequently updated query from causing many database re-queries and redundant cache delete operations. However, reads should never coalesce to already-invalidated readers (see next two points).
  2. During an inflight read, incoming invalidations mark the read as invalidated and wait for ongoing cache set operations to finish. This ensures that new readers kicked off due to the invalidation will not coalesce to readers with stale results, which would result in a skipped invalidation for the upstream.
  3. During an inflight invalidation, incoming reads should be marked as invalidated and not set the cache, since they could race with the invalidation to get from the cache and setting the cache in this case would yield stale results for future reads.

Two simultaneous readers coalesce to receive the same result

A read-invalidation rendezvous

It was tricky to write code that satisfied these requirements. To test edge cases and ensure that the constraints are upheld under all possible operation sequences, we use multiple validation methods. First, we have a chaos test. Many threads simultaneously run reads and invalidations against a small set of keys to increase concurrency. This gives us high confidence before shipping changes to production.

Second, we use online cache verification to randomly sample queries and ensure they are kept up to date. This checker simultaneously queries a key in both the cache and the primary database, and if the results differ, reports whether an invalidation was seen or skipped.

As a final catch-all, we check for convergence of query results between our old query engine and the new LiveGraph 100x engine. This convergence checker helped us to migrate safely and incrementally, by first verifying the new engine’s correctness for each case before actually switching over to it. Because our old engine is much less efficient, this checker actually required a lot of fine-grained tuning—oftentimes the old system is seconds slower than the new one!

Lessons and future work

We’ve come a long way from the single Node.js server that we started with. Now, LiveGraph is made up of several services that can be scaled independently to match our growth. Notably, the invalidator service can scale as the database team adds new vertical and horizontal shards, while the edge and cache services can scale to handle increasing user traffic and the number of active queries. And scaling any single service doesn’t disproportionately increase the fan-in or fan-out of messages over the network! This is a huge improvement over our old one-piece toolkit of scaling up the entire fleet whenever load became an issue.

There were so many other details—like coming up with a safe migration plan to the new service, how we delivered incremental value, and details about our new sharding strategy—that Braden Walker, a fellow software engineer on our team, covered in his recent Systems@Scale talk, so check it out to learn more. LiveGraph 100x is an ongoing effort and just one of the many steps in our LiveGraph journey. We’re thinking about several future projects: automatic re-sharding of invalidators, resolving queries from non-Postgres sources in the cache, first-class support for server-side computation like permissions. If any of these problems interest you, we’re hiring!

The design and implementation of LiveGraph 100x has been a huge team effort, involving everyone on Figma’s Web Platform team: Bereket Abraham, Braden Walker, Cynthia Vu, Deepan Saravanan, Elliot Lynde, Jon Emerson, Julian Li, Leslie Tu, Lin Xu, Matthew Chiang, Paul Langton, and Tahmid Haque.

ホーム - Wiki
Copyright © 2011-2024 iteam. Current version is 2.139.0. UTC+08:00, 2024-12-28 09:13
浙ICP备14020137号-1 $お客様$