Jellyfish: Cost-Effective Data Tiering for Uber’ s Largest Storage System
摘要
Uber deploys a few storage technologies to store business data based on their application model. One such technology is called Schemaless, which enables the modeling of related entries in one single row of multiple columns, as well as versioning per column.
Schemaless has been around for a couple of years, amassing Uber’s data. While Uber is consolidating all the use cases on Docstore, Schemaless is still the source of truth for different pre-existing customer pipelines. As such, Schemaless uses fast (but expensive) underlying storage technology to enable millisecond-order latency at high QPS. Furthermore, Schemaless deploys a number of replicas per region to ensure data durability and availability in the face of different failure models.
Accumulating more data while using expensive storage, Schemaless has increasingly become a key concern for cost and thus required attention. To this end, we carried out measurements for understanding data access patterns. We found that data is frequently accessed for a period of time, after which it is accessed less frequently. The exact period varies from one use case to another, however, old data must still be readily available upon request.
欢迎在评论区写下你对这篇文章的看法。