TiDB Adoption at Pinterest

Authors: Alberto Ordonez Pereira; Senior Staff Software Engineer | Lianghong Xu; Senior Manager, Engineering |

This is the second part of a three series and focuses on how we selected the new storage technology that ended up replacing HBase.

Motivation

HBase has been a foundational storage system at Pinterest since its inception in 2013, when it was deployed at a massive scale and supported numerous use cases. However, it started to show significant inadequacy to keep up with the evolving business needs due to various reasons mentioned in the previous blog. As a result, two years ago we started searching for a next-generation storage technology that could replace HBase for many years to come and enable business critical use cases to scale beyond the existing storage limitations.

Methodology for Selecting a Datastore

While many factors need to be considered in the decision making process, our experiences with the pain points of HBase offered invaluable insights that guided our criteria for selecting the next datastore.

  • Reliability. This is crucial, as the datastore would serve Pinterest online traffic. An outage could significantly impact Pinterest’s overall site availability. Key aspects to consider:
    - Resilience to failure of one node or an entire Availability Zone (AZ).
    - Maturity of maintenance operations such as capacity changes, cluster upgrade, etc.
    - Support for multi-region deployment with an active-active architecture.
    - Disaster recovery to support operations necessary to recover the system from disaster scenarios. This includes full backup and restore at the granularity of cluster/table/database level and Point In Time — Recovery (PITR) (or alike) to provide different Recovery Point Objectives (RPOs) to our clients.
  • Performance. The system should achieve consistent and predictable performance at scale.
    - Reasonably good performance with Pinterest’s workloads. While public benchmark results provide useful data points on paper, we need to battle test the system with production shadow traffic.
    - Sustained performance in terms of tail latency and throughput is key. Transient failures, offline jobs, or maintenance operations should not have a major impact on the performance.
  • Functionality. The system should provide essential built-in functionalities to facilitate application development. For us, these include:
    - Global secondary indexing. Many applications require secondary indexes for efficient data access. However, existing solutions either do not scale (e.g., MySQL), require heavy work from developers to maintain customized indexes (e.g., HBase, KVStore), or do not guarantee index consistency (e.g., Ixia).
    - Distributed transactions. ACID semantics in distributed transactions make it easy for developers to reason about their applications’ behaviors. It is important to achieve this without compromising much performance.
    - Online schema changes. Schema changes are a constant need and should be conducted in a reliable, fast, and consistent way.
    - Tunable consistency. Not all use cases require strong consistency, and it would be desirable to have certain flexibility on performance vs consistency tradeoffs.
    - Multi-tenancy support. A singular deployment should be able to serve multiple use cases to keep costs at a reasonable level. Being able to avoid noisy neighbors is fundamental to good quality of service.
    - Snapshot and logical dump. Offline analytics require the capability to export a full snapshot of a database/table to an object storage such as S3.
    - Change Data Capture (CDC). CDC is an essential requirement for many near-real-time use cases to stream database changes. It is also needed to support incremental dumps or to keep replicated clusters in sync (for disaster recovery or multi region purposes).
    - Data compression. Fast and effective data compression is a must to keep the overall space usage under control.
    - Row-level Time To Live (TTL). Helps with keeping data growth at bay for use cases that just need ephemeral storage.
    - Security. Access controls, encryption at rest and in transit, and other similar features are required for security compliance.
  • Open source and community support. At Pinterest, we advocate for embracing and contributing to open source technologies.
    - An active and thriving community generally indicates continued product improvement, increasing industry usage, and ease to attract talents — all key to successful long-term adoption.
    - Documentation quality is of critical importance for self learning and is also an indicator of the product maturity.
    - A permissive license (e.g., Apache License 2.0) provides more flexibility in the software usage.
  • Migration efforts. The migration efforts from HBase to the new system should be manageable with justifiable return on investment.
    - Migration tooling support (e.g., bulk data ingestion) would be a requirement for initial proof-of-concept development and large-scale production migrations.

With all these considerations in mind, here is what we did:

  1. We identified datastore technologies that were relevant to the workloads we planned to support.
  2. We ran a matrix analysis on paper considering the criteria above based on publicly available information, which helped us exclude a few options in the early phase.
  3. We selected the three most promising technologies that passed the initial screening in step 2 and evaluated them using public benchmarks with synthetic workloads.
  4. We tested the datastores internally at Pinterest using shadow traffic, mirroring production workload traffic to the new datastore candidates. This process provided valuable insights into the systems’ reliability and performance characteristics, helping us differentiate and make a final decision among the top candidates. Ultimately, we selected TiDB.

The TiDB Adoption Journey

In 2022, we started the evaluation by picking 10+ datastore technologies that were potential candidates for our workloads, including Rockstore (in-house KV datastore), ShardDB (in-house sharded MySQL), Vitess, VoltDB, Phoenix, Spanner, CosmosDB, Aurora, TiDB, YugabyteDB, and DB-X (pseudonym) .

We then ran the matrix analysis using the methodology described above, which ended up excluding the majority of the candidates. Below is the list of datastores we excluded and the reasoning behind it:

  • Rockstore: no built-in support for secondary indexing and distributed transactions.
  • ShardDB: no built-in support for global secondary indexing or distributed transactions (only at shard level). Horizontal scaling is challenging and requires manual resharding.
  • Vitess: cross shard queries may not be optimized. Seems to incur relatively high maintenance costs.
  • VoltDB: does not provide data persistence required to store source-of-truth data.
  • Phoenix: built on top of HBase and shares a lot of pain points with HBase.
  • Spanner: appealing in many dimensions but is not open source. The migration cost from AWS to GCP could be prohibitively high at our scale.
  • CosmosDB: (similar to above).
  • Aurora: while offering great read scalability, its limitation in write scalability can be a showstopper for some of our business critical use cases.

The three remaining options were TiDB, YugabyteDB, and DB-X. They all belong to the NewSQL database category, which combines scalability from NoSQL datastores and ACID guarantees from traditional RDBMSs. On paper, all three datastores seemed promising with similar functionalities. We then conducted preliminary performance evaluation using some of the YCSB benchmarks with a minimal setup, for which all datastores provided acceptable performance.

While results with synthetic workloads provided useful data points, we realized that the deciding factor would be stability and sustained performance under Pinterest’s production workloads.

To break the tie, we built a POC for each of the three systems within Pinterest infrastructure and conducted shadow traffic evaluation with a focus on performance and reliability.

We used shadow traffic from Ixia, our near-real-time indexing service built on top of HBase and an in-house search engine (Manas). Ixia supported a diverse array of use cases, encompassing various query patterns, data sizes, and QPS volumes, which we believed were a good representation of our production workloads. Specifically, we selected a few use cases with large data sizes (on the order of TBs), 100k+ QPS, and a decent number of indexes to gain insights into how these systems perform at a larger scale.

We evaluated the three final candidates using the same shadow production workloads. To be as fair as possible, we worked directly with the support teams of these systems on tunings/optimizations to the best of our capabilities. From our observations, YugabyteDB and DB-X had some struggles to provide sustained performance under our workloads. Example issues include sporadic high CPU usage for individual nodes that led to latency increase and cluster unavailability, significant write performance degradation as the number of indexes increases, and query optimizer not picking the optimal indexes in query analysis. On the other hand, TiDB was able to sustain the load after several rounds of tuning while providing generally good performance. As a result, after a few months of evaluation, TiDB stood out as the most promising candidate.

To close the loop, we ran a number of reliability tests against TiDB, including node restart, cluster scale-out, graceful/forceful node termination, AZ shutdown, online DMLs, cluster redeploy, cluster rotation, etc. While we experienced some issues (such as slow data transfer during TiKV node decommissioning), we did not spot any fundamental flaws.

With that, we made the final decision to move on with TiDB. It is important to note that the decision making process was based on our best understanding at the time of evaluation (2022) with Pinterest’s specific workloads. It is entirely possible that others may come up with a different conclusion depending on their own requirements.

TiDB in Production

In this blog we won’t be covering TiDB specific architecture details as those can be found in many places, specially on official documentation. It is advised to become familiarized with TiDB’s basic concepts before proceeding with the read.

Deployment

We use Teletraan, an in-house deployment system, to run TiDB, while the vast majority of TiDB customers deploy TiDB using Kubernetes and use TiDB’s built-in tool (TiDB Operator) for cluster operation. This is mainly due to the lack of support for Kubernetes at Pinterest back when we started the adoption. This implies that we had to replicate functionalities in the TiDB Operator and develop our own cluster management scripts, which was not ideal and created frictions during initial integration. While we have achieved significant stability improvements with the current setup, as the infra support for Kubernetes/EKS matures at Pinterest, we are planning to migrate TiDB onto EKS.

By default, we use three-way replication for all of our TiDB deployments. When needed, we use an additional read-only replica for data snapshots to minimize the impact on online serving. Compared to our HBase deployment with six replicas (two clusters, each with three replicas), TiDB allows us to reduce the storage infra cost by almost half in many cases.

Currently, TiDB is deployed in a single AWS region, with three replicas each deployed in a different availability zone (AZ). We use placement groups to distribute a Raft group among the three AZs to be able to survive an AZ failure. All communications between the three key TiDB components (PD, SQL, and TiKV) are protected using mutual-TLS plus CNAME validation. The only layer that is exposed to external systems is the SQL layer, which, as of today, is fronted by Envoy. At the time of writing, we are exploring different multi-region setups, as well as removing Envoy as the proxy to the SQL layer (since we need better control over how we manage and balance connections).

Compute Infrastructure

We currently deploy TiDB on instance types with intel processors and local SSDs. Nevertheless, we are exploring migrating to Graviton instance types for better price-performance and EBS for faster data movement (and in turn shorter MTTR on node failures) in the future. Below describes how we run each of the core components on AWS.

The PD layer runs on c7i, with a variety of vCPUs depending on the load level. The two main factors to scale a PD node in are the cluster size (the more regions the more workload), and offline jobs making region location requests to PDs.

The SQL layer mostly runs on m7a.2xlarge. Given that the SQL layer is stateless, it is relatively easy to scale the cluster out to add more computation power.

The TiKV layer (stateful) runs on two instance types, based on the workload characteristics. Disk bound workloads are supported with i4i.4xlarge instances (3.7TB NVMe SSD), whereas compute bound workloads are on c6id.4xlarge instances. We also use i4i instances for TiKV read-only nodes (Raft-learners) since they are generally storage bound.

Online Data Access

TiDB exposes a SQL compatible interface to access data and to administer the cluster. However, at Pinterest, we generally do not expose datastore technologies directly to the clients. Instead, we proxy them with an abstract data access layer (typically a Thrift service) that satisfies the needs of most clients. The service we use to proxy TiDB (and other technologies) is called Structured DataStore (SDS), which will be covered in detail in the third part of this blog series.

Offline Analytics

While TiDB offers TiFlash for analytical purposes, we are not currently using it as it may be more recommended for aggregate queries rather than ad hoc analytical queries. Instead, we use TiSpark to take full table snapshots and export them to S3. These snapshots are exported as partitions of a Hive table and then used for offline analytics by our clients.

During productionisation of the TiDB offline pipeline, we have identified some challenges and their mitigations:

  • Clients requesting too frequent full snapshots. For such cases, we would generally recommend the use of incremental dumps via CDC plus Iceberg.
  • TiSpark overloading the cluster due to a couple of reasons:
  • PD overloading: TiSPark, when taking the snapshot, needs to ask the PD layer for the location of each region. In large clusters, this may be a lot of data, which causes the PD node CPU to spike. This could affect online workloads since TiSPark only communicates with the PD leader, which is responsible for serving TSO for online queries.
  • TiKV overloading: The data ultimately comes from the storage layer, which is already busy with online processing. In order to avoid having a high impact on online query processing, and for use cases that demand that, we spin up Raft-learners or read-only nodes that are used by TiSPark, hence nearly isolating any potential impact of the offline processing (the network is still shared).

Change Data Capture

CDC is a fundamental component in nearly all storage services at Pinterest, enabling important functionalities including:

  • Streaming database changes for applications that require observations of all the deltas produced to a particular table or database.
  • Cluster replication, which may be used for high availability, to achieve multi-region deployments, etc.
  • Incremental dumps, which can be achieved by means of full snapshots and CDC deltas with Iceberg tables. This removes a lot of pressure from the cluster due to offline jobs, specially if clients need very frequent data dumps.

In our experience — and this is something we have been working with PingCap for a while now — TiCDC, which is TiDB’s CDC framework, has some throughput limitations that have been complicating the onboarding of large use cases (that require CDC support). It currently caps at ~700MB/s throughput, but a complete re-architecture of TiCDC to remove this limitation may be on the horizon. For use cases that ran into this situation, we were able to circumvent it by using flag tables, which are essentially tables consisting of a timestamp and a foreign key to the main table. The flag table is always updated when the main table is, and CDC is defined on the flag, hence outputting the id of the row that has been modified.

In terms of message format, while we currently use the open protocol due to its more efficient binary format and lower overhead in schema management, we are actively experimenting with a migration to Debezium (the de-facto standard at Pinterest) to simplify upstream applications development.

Disaster Recovery

We run daily (or hourly in some cases) full cluster backups that are exported to S3 by using the TiDB BR tools. We have also enabled the Point In Time Recovery (PITR) feature, which exports the delta logs to S3 to achieve RPOs (Recovery Point Objective) on the order of seconds/minutes.

Taking backups can impact overall cluster performance, as a significant amount of data is moved through the network. To avoid that, we limit the backup speed to values we feel comfortable with (ratelimit=20; concurrency=0; by default). The backup speed varies depending on the number of instances and the amount of data in a cluster (from 2Gb/s to ~17 GB/s). We have a retention of 7–10 days for full daily backups, and 1–2 days for hourly backups. PITR is continuously enabled with a retention period of ~ 3 days.

To recover from a disaster (a total non-reversible cluster malfunction), we run a backup restore to an empty standby cluster. The cluster is pre-provisioned to minimize the MTTR during an outage. And even though it adds to the overall infra footprint, the cost is amortized as the number of clusters increases. A complication with having a recovery cluster, is that not all of our production clusters run with the same configuration, which may require updating it before carrying out a restore.

As of today, we use the full cluster restore without PITR as the most immediate and fast recovery mechanisms. This is because unfortunately the current implementation of TiDB PITR is slow (hundreds of MB/s of restore speed), compared to around 10–20 GB/s we are able to achieve with full restore. While this approach may result in temporary data loss, it is a necessary trade-off. Keeping a cluster unavailable for hours or even days to complete PITR for a large-scale restoration would be unacceptable. It is worth mentioning that significant performance improvements are expected on a future TiDB release, so this may be subject to change. PITR would still be applied to recover the deltas after cluster availability is restored.

Wins and Learnings

It is a huge undertaking to adopt a major technology like TiDB at our scale and migrate numerous legacy HBase use cases to it. This marks a significant step forward towards a more modernized online systems tech stack and, despite all the challenges along the way, we have seen great success in TiDB adoption at Pinterest. Today, TiDB hosts hundreds of production datasets, powers a broad array of business critical use cases, and is still gaining increasing popularity. Below we share some of the key wins and lessons learned in this journey.

Wins

  • Developer velocity increases: MySQL compatibility, horizontal scalability, and strong consistency form the core value proposition of TiDB at Pinterest. This powerful combination enables storage customers to develop applications more quickly without making painful trade-offs. It also simplifies reasoning about application behaviors, leading to improved overall satisfaction and increased developer velocity.
  • System complexity reduction: The built-in functionalities of TiDB allowed us to deprecate several in-house systems built on top of HBase (e.g., Sparrow for distributed transactions and Ixia for secondary indexing). This resulted in significant reduction in system complexity and overall maintenance overhead.
  • Performance improvements: We have generally seen very impressive (~ 2–10x) p99 latency improvements when migrating use cases from HBase to TiDB. More importantly, TiDB tends to provide much more predictable performance, with fewer and smaller spikes under the same workloads.
  • Cost reduction: In our experience, HBase to TiDB migrations generally lead to around 50% infra cost saving, mainly from reducing the number of replicas from six to three. In some cases, we also achieved greater savings (up to 80%) from additional compute efficiency improvements (e.g., via query pushdown).

Learnings

  • There are a large number of factors to consider when it comes to database selection. The phased approach with initial filtering helped us shorten the evaluation duration under resource constraints. While some systems may be appealing on paper or with synthetic benchmarks, the final decision boils down to how they perform under real-world workloads.

  • While we had concerns about TiDB’s availability because it could become unavailable when losing two replicas for the same region, this has not been a real issue for us in the past two years. The vast majority of the outages we had were due to human operational errors because the in-house deployment system was not originally designed to support stateful systems, and it was error prone to manage deployment configs. This further motivated us to migrate to EKS using Infrastructure as Code (IaC).

  • Running and operating TiDB at Pinterest’s scale has brought some unique challenges that were unseen by PingCap (some were mentioned above). Examples include:- TiCDC is not truly horizontally scalable and hits throughput limitations.- The data movement is relatively slow during backup and host decommissioning with under-utilized system resources.- Parallel Lightning data ingestion for large datasets could be tedious and error prone.

    - TiSpark jobs could overload PD and cause performance degradation.

Fortunately, PingCap is working with us to provide optimizations/mitigations to these issues.

  • Lock contention is one of the major contributors to our performance issues. We needed to pay special attention to the initial schema designs and work with client teams to minimize contentions. This was not a problem with HBase, which was schemaless and did not support distributed transactions with locking mechanisms.

Up Next

Thus far in this blog series we have described the reasons behind our decision to deprecate HBase and the rationale for selecting TiDB as its replacement. As previously mentioned, at Pinterest, neither HBase nor TiDB are directly exposed to clients for use. Instead, they are accessed through data access services that safeguard both the datastores and the clients. In the next blog post, we will delve into how we replaced the multiple service layers on top of HBase with a unified framework called Structured Datastore (SDS). SDS powers various data models and enables seamless integration of different datastores. It is not merely an online query serving framework but a comprehensive solution for providing Storage as a Platform at Pinterest. Stay tuned for more details…

Acknowledgements

HBase deprecation, TiDB adoption and SDS productionization would not have been possible without the diligent and innovative work from the Storage and Caching team engineers including Alberto Ordonez Pereira, Ankita Girish Wagh, Gabriel Raphael Garcia Montoya, Ke Chen, Liqi Yi, Mark Liu, Sangeetha Pradeep and Vivian Huang. We would like to thank cross-team partners James Fraser, Aneesh Nelavelly, Pankaj Choudhary, Zhanyong Wan, Wenjie Zhang for their close collaboration and all our customer teams for their support on the migration. Special thanks to our leadership Bo Liu, Chunyan Wang and David Chaiken for their guidance and sponsorship on this initiative. Last but not least, thanks to PingCap for helping along the way introduce TiDB into the Pinterest tech stack from initial prototyping to productionization at scale.

- 위키
Copyright © 2011-2024 iteam. Current version is 2.137.1. UTC+08:00, 2024-11-09 23:43
浙ICP备14020137号-1 $방문자$