Figma的数据库团队是如何应对规模挑战的
Vertical partitioning was a relatively easy and very impactful scaling lever that bought us significant runway quickly. It was also a stepping stone on the path to horizontal sharding.
垂直分区是一种相对容易且非常有影响力的扩展手段,可以快速为我们提供重要的增长空间。它也是水平分片的一个过渡阶段。
Figma’s database stack has grown almost 100x since 2020. This is a good problem to have because it means our business is expanding, but it also poses some tricky technical challenges. Over the past four years, we’ve made a significant effort to stay ahead of the curve and avoid potential growing pains. In 2020, we were running a single Postgres database hosted on AWS’s largest physical instance, and by the end of 2022, we had built out a distributed architecture with caching, read replicas, and a dozen vertically partitioned databases. We split groups of related tables—like “Figma files” or “Organizations”—into their own vertical partitions, which allowed us to make incremental scaling gains and maintain enough runway to stay ahead of our growth.
自2020年以来,Figma的数据库规模增长了近100倍。这是一个好问题,因为它意味着我们的业务在扩张,但同时也带来了一些棘手的技术挑战。在过去的四年中,我们付出了巨大的努力,以保持领先并避免潜在的增长痛点。在2020年,我们运行着一个托管在AWS最大物理实例上的单个Postgres数据库,到2022年底,我们已经构建了一个具有缓存、读取副本和十几个垂直分区数据库的分布式架构。我们将相关表组(如“Figma文件”或“组织”)拆分为它们自己的垂直分区,这使我们能够进行增量扩展并保持足够的时间来应对我们的增长。
Despite our incremental scaling progress, we always knew that vertical partitioning could only get us so far. Our initial scaling efforts had focused on reducing Postgres CPU utilization. As our fleet grew larger and more heterogeneous, we started to monitor a range of bottlenecks. We used a combination of historical data and load-testing to quantify database scaling limits from CPU and IO to table size and rows written. Identifying these limits was crucial to predicting how much runway we had per shard. We could then prioritize scaling problems before they ballooned into major reliability risks.
尽管我们的增量扩展进展顺利,但我们始终知道垂直分区只能带我们走到这一步。我们最初的扩展工作集中在减少Postgres的CPU利用率上。随着我们的集群规模越来越大且更加异构,我们开始监控一系列瓶颈。我们使用历史数据和负载测试相结合,量化了从CPU和IO到表大小和写入行数的数据库扩展限制。识别这些限制对于预测每个分片的剩余时间非常重要。然后,我们可以在它们膨胀成重大可靠性风险之...