TiDB-A Globally Scalable NewSQL Databases

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. TiDB: A Globally NewSQL Database Xiaole Fang | PingCAP User Ecosystem Group Leader
2. The C hoice of H istory ● Why NewSQL ● How to build TiDB ● How to build TiKV & PD ● How to build SQL layer (TiDB ) ● How to use TiDB
3. GFS BigTable MapReduce SQL Informix Teradata Ingres@UCB Network Model Hierarchical Model 1960s 1970 1980 VoltDB OceanBase TiDB / TiKV Greenplum Vertica Sybase 1990 2000 Spanner / F1 2010 2020 Oracle Hana System R@IBM E.F. Codd Relational Model SQL Server PostgreSQL DB2 MySQL MySQL Sharding Aurora Redis CockroachDB HBase Cassandra MongoDB PingCAP.com
4. MySQL - Read/Write Splitting Characteristics: Read/write splitting (application delivery) Sharding Traffic bottleneck of write Cost: Application delivery on read/write splitting Sharding requires easier SQL
5. MySQL - Sharding Characteristics: Sharding Support horizontal scalability on write Cost: Choose business dimension Multi-dimension and mass redundance Data synchronization(double write、 asynchronization and synchronization) A transparent database Low usage of machine resource APP JDB Proxy C
6. Sharding + Proxy Cost ● Hard for application ● constraints ● Cost ● No HA ● … PingCAP.com
7. How many copies we need? (Number of copies) Online Business System (Total number of copies) Data Channel ODS, Middle Ground Report, BI, Analysis PingCAP.com
8. It’s time ! We need a new database
9. What is NewSQL? ● ● ● ● Scale-out! HA with strong consistency Transparent ACID transaction support Full-featured SQL ○ ● Cloud infrastructure friendly ○ ● Easily works with K8S and docker High throughput and multi-tenant ○ ● Compatible with MySQL would be nice One cluster fits for all Try HTAP ○ ETL is boring
10. So what we should think about ? ● Computer engine ● Storage engine ● Distributed transaction ● Replication ● Data format ● Share what ● Optimizer ● … MySQL Drivers(e.g. JDBC) 无状态的计算层 MySQL Protocol TiDB RPC 分布式且支持事务的 Key- Value 存储引擎 TiKV PD
11. We choose a key-value store (RocksDB) ● Good start, RocksDB is fast and stable. ○ ○ ○ LSM better for insert Atomic batch write Snapshot ● However… it’s a local embedded kv store. ○ ○ Can’t tolerate machine failures Scalability depends on the capacity of the disk PingCAP.com
12. Replica ● Use Raft to replicate ● data Strong leader ● Leader election ● Membership change ● Implementation: ○ Ported from etcd ● Replicas are distributed across machines/racks/data-centers Client State Machine a = 1 Raft Module b = 2 Log State Machine a = 1 Raft Module b = 2 a = 1 b = 2 Log State Machine a = 1 Raft Module b = 2 a = 1 b = 2 Log a = 1 b PingCAP.com = 2
13. Let’s fix Fault Tolerance Raft Raft RocksDB RocksDB RocksDB Machine 1 Machine 2 Machine 3 PingCAP.com
14. Region is useful ● How can we scan data? ○ TiKV Key Space How to support API: scan (startKey, endKey, limit) ● So, we need a globally ordered map ● Region: Continuous key-value pairs, in byte order ○ ○ ○ Can’t use hash partitioning Use range partitioning ■ Region 1 -> [a - d] ■ Region 2 -> [e - h] ■ … ■ Region n -> [w – z] Fixed size: = 96MB [ start_key, end_key) (-∞, +∞) Sorted Map ● Data are stored/accessed/replicated/scheduled in units of Region
15. Data organization within TiDB Local RocksDB instance t1_r1 v1 t1_r2 v2 ... ... t5_r1 ... Region 2 t5_r10 ... Region 3 t1_i1_1_1 ... Region 4 t1_i1_2_2 ... ... ... t1_i6_1_3 ... ... ... TiKV Node Region 1 Store 1 Region 1 ... Region 2 Region 3 Region 4
16. How scale-out works inside TiDB TiKV Node 1 TiKV Node 2 TiKV Node 3 Store 1 Store 2 Store 3 Region 1* Region 1 Region 1 Let’s say, the amount of data within Region 1 exceeds the threshold (default: 96MB) PD PD PD
17. How scale-out works inside TiDB I think I should split up Region 1 TiKV Node 1 TiKV Node 2 TiKV Node 3 Store 1 Store 2 Store 3 Region 1* Region 1 Region 1 Let’s say, the amount of data within Region 1 exceeds the threshold (default: 96MB) PD PD PD
18. How scale-out works inside TiDB TiKV Node 1 TiKV Node 2 TiKV Node 3 Store 1 Store 2 Store 3 Region 1* Region 1 Region 2* Region 1 Region 2 Region 2 Region 1 will be split into two smaller regions. (the leader of Region 1 sends a Split command as a special log to its replicas via the Raft protocol. Once the Split command is successfully committed by Raft, that means the region has been successfully split.) PD PD PD
19. How scale-out works inside TiDB TiKV Node 1 TiKV Node 2 TiKV Node 3 TiKV Node 4 Store 1 Store 2 Store 3 Store 4 Region 1* Region 1 Region 2* Region 1 Region 2 Region 2 PD PD: “Hey, Node1, create a new replica of Region 2 in Node 4, and transfer your leadership of Region 2 to Node 2” PD PD
20. How scale-out works inside TiDB TiKV Node 1 TiKV Node 2 TiKV Node 3 TiKV Node 4 Store 1 Store 2 Store 3 Store 4 Region 1* Region 1 Region 2 Region 1 Region 2 Region 2 Region 2* PD PD PD
21. How scale-out works inside TiDB TiKV Node 1 TiKV Node 2 TiKV Node 3 TiKV Node 4 Store 1 Store 2 Store 3 Store 4 Region 1* Region 1 Region 1 Region 2 Region 2 Region 2* PD PD: “OK, Node 1, delete your local replica of Region 2” PD PD
22. Region split & merge Split Region 1 [a-n) Region 1 [a-z) increase Region 1 [a-z) decrease Region 1 [a-n) Split Merge decrease Region 2 [n-z) Region 1 [a-z) Region 2 [n-z) Merge Region splitting and merging affect all replicas of one region. The correctness and consistency are guaranteed by Raft.
23. PD as the Cluster Manager Node 1 PD HeartBeat Cluster Info Region A Region B Movement Node 2 Scheduling Command Admin Scheduling Stratege Config Region C PingCAP.com
24. Now we have a complete key-value store Client RPC RPC RPC RPC Placement Driver PD 1 Raft Group Store 2 Region 1 Region 1 Region 2 Region 1 Region 3 Region 2 Region 5 Region 2 Region 5 Region 4 Region 3 Region 5 Region 4 Region 3 TiKV node 1 TiKV node 2 Store 3 Store 4 Store 1 PD 2 PD 3 Region 4 TiKV node 3 TiKV node 4 PingCAP.com
25. TiKV: Architecture overview (Logical) We have A Distributed Key-Value Database with ○ Geo-Replication / Auto Rebalance ○ ACID Transaction support ○ Horizontal Scalability API (gRPC) Transactional KV API Transaction TiKV MVCC Multi-Raft (gRPC) Raw KV API RocksDB PingCAP.com
26. MVCC (Multi-Version Concurrency Control) ● Each transaction sees a snapshot of database at the beginning time of this transaction, any changes made by this transaction will not be seen by other transactions until the transaction is committed ● Data is tagged with versions ○ Key_version: value ● Lock-free snapshot reads PingCAP.com
27. Transaction in TiKV • Nothing fancy: 2-Phase Commit • Almost decentralized • Timestamp allocator • • • • Paper: Google Percolator Optimistic transaction model Default isolation level: Snapshot Isolation Also has Read Committed isolation level PingCAP.com
28. Now let’s talk about SQL
29. What if we support SQL? Pkey Name age Email 1 Edward 29 h@pingcap.com 2 Tom 38 tom@pingcap.com Dictionary order TiKV tidb-server Tables (Rows with columns) t1_r1 {col1: v1, col2: v2, col3: v3 ...} t1_r2 {col1: vv1, col2: vv2, col3: vv3 ...} ... ... t2_r1 {col1: v1, col2: v2 ...} ... ... Key-value pairs within TiKV ordered by key PingCAP.com
30. Secondary index Index: Name Pkey Name Email 1 Edward h@pingcap.com 2 Tom tom@pingcap.com ... Index Value Pkey Edward 1 Tom 2 tbl/user/1 Edward,h@pingcap.com tbl/user/2 Tom,tom@pingcap.com ... ... Data table Index Value Pkey h@pingcap.com 1 tom@pingcap.com 2 Index: Email idx/user/name/Edward 1 idx/user/name/Tom 2 ... ... idx/user/mail/h@pingcap.com 1 idx/user/mail/tom@pingcap.com 2 ... PingCAP.com
31. Schema change in distributed RDBMS? ● A must-have feature, especially for large tables(billions of rows)! ● But you don’t want to lock the whole table while changing schema. ○ Usually distributed database stores tons of data spanning multiple machines ● We need a non-blocking schema change algorithm ● Thanks F1 again ○ Similar to Online, Asynchronous Schema Change in F1 - VLDB 2013 Google PingCAP.com
32. Predicate pushdown TiDB Server age > 20 and age < 30 TiDB knows that Region 1 / 2 / 5 stores the data of person table. age > 20 and age < 30 age > 20 and age < 30 Coprocessor Coprocessor Coprocessor Region 1 Region 2 Region 5 TiKV Node1 TiKV Node2 TiKV Node3
33. Now let’s talk about SQL layer.
34. TiDB (TiDB–server) ● ● Stateless SQL layer ○ Clients can connect to any existing tidb-server instance Full-featured SQL Layer ○ Speaks MySQL wire protocol ○ CBO ○ Secondary index support ○ Online DDL SQL Optimized Logical Plan Cost Model Statistics Selected Physical Plan tidb-server TiKV Logical Plan AST TiKV TiKV TiKV TiKV TiKV TiKV Cluster PingCAP.com
35. Parallel operators SELECT t.c2, t1.c2 FROM t JOIN t1 on t.c = t1.c WHERE t1.c1 > 10; Projection Join Operator Data Reader Join Worker Join Worker Join Worker Join DataSourc e t Filter t1.c1 > 10 TableScan: t IndexScan: t1 idx1 Scan Workers Scan Workers DataSourc e t1 TiKV Cluster: Coprocessor Workers PingCAP.com
36. Architecture of TiDB
37. PD PD Meta data PD TiDB PD Cluster TSO/Data location MySQL clients Schedule KV API TiKV Coprocessor Load Balancer (Optional) Txn, Transaction MySQL Protocol MySQL Protocol TiDB SQL Layer TiDB SQL Layer KV API DistSQL API TiDB Server (Stateless) KV API DistSQL API DistSQL API MVCC RawKV, Raft KV TiDB Server (Stateless) RocksDB Pluggable Storage Engine (e.g. TiKV) General Architecture
38. Spark Cluster TiSpark Worker TiFlash Node 2 TiDB TiDB TiSpark Worker TiFlash Node 1 TiFlash Extension Cluster TiKV Node 1 TiKV Node 2 TiKV Node 3 Store 1 Store 2 Store 3 Region 1 Region 4 Region 2 Region 2 Region 3 Region 3 Region 3 Region 2 Region 4 Region 4 Region 1 Region 1 TiKV Cluster
39. HTAP PD PD Data location TSO/Data location PD PD Cluster Metadata Spark Driver TiDB TiKV TiKV Job MySQL Clients TiDB TiDB Syncer DistSQL API DistSQL API TiKV Worker TiKV Worker TiDB TiDB ... TiKV TiKV Worker ... TiKV Cluster ... (Storage) Spark Cluster DistSQL API TiDB Cluster flash flash TiSpark PingCAP.com
40. So we can... Spark on TiKV … MPP CBO hash join … index Auto region split Column base + Spark … Complex query Complex aggregation or analysis query Big table join High concurrent writing OLTP OLAP
41. Ecosystem Tools ● MySQL Binlog Lightning ○ ● Upstream Database Fast offline data importing, MySQL => TiDB DM Master DM (Data Migration) ○ DM Worker TiDB TiKV TiDB Binlog ○ DM Data migration tool, makes TiDB as a slave replica of a MySQL/MariaDB master ● DM Worker TiKV Cluster TiDB Change data capture (CDC) streaming, sync all TiKV TiDB change within TiDB to downstream application like Kafka/MySQL/… ● All open-sourced! TiDB-Binlog Pump Drainer Pump Pump ... Downstream Database TiKV
42. TiDB-DM: TiDB Data Migration Tool Master Master syncer Master syncer Master syncer syncer Master syncer Slave cluster PingCAP.com
43. Cloud TiDB Database service - business tenant TiDB Operator Deployment TiDB Cluster Controller TiKV Controller PD Controller TiDB Controller Database analysis - business tenant DBaaS/DBA TiDB Scheduler: Kube Scheduler + Scheduler Extender DaemonSet Volume Manager GC Controller TiDB Cloud Manager RESTFul Interface External Service manager Load balancer manager TiDB Cloud DaemonSet Kubernetes Core Scheduler Controller Manager API Server Container Cluster PingCAP.com
44. TiDB Application Scenarios ● Massive data - high concurrency OLTP system ● Multiple data centers at financial level ● Multi-sources high throughout convergence and real-time computing ● Real-time database warehouse ● High-performance HTAP ● DBaaS
45. TiDB 部分用户公开案例(Oct 2019) 互联网 金 ● ● 【美团点评】http://t.cn/EAFCqhl 【今日头条】http://t.cnEMf ● ● 【知乎】https://dwz.cn/er7iwbIz 【小红书】http://1t.click/M9v ● ● 【一面数据】http://t.cn/RT9r5di 【猿辅导】http://t.cn/RTKnKSX ● ● 【转转(一)】http://t.cn/R1MAXEq 【转转(二)】http://t.cn/EfuJ4vi ● ● 【爱奇艺】http://t.cn/EvErsc1 【易果生鲜】http://t.cn/RTYVhzH ● ● 【凤凰网】http://t.cn/RHRQfNT 【Mobikok】http://t.cn/Rm1F6lg ● ● 【摩拜单车(一)】http://t.c/RnLfn/RT8FbP6 ● 【摩拜单车(二)】http://t.cn/EVjPYgj ● 【二维火】http://t.cn/R8bXM2f 【客如云】http://t.cn/R1wSEJH ● ● 【小米科技】http://t.cn/Ey2xCDK 【零氪科技】http://t.cn/REj7tSv ● ● 【同程旅游(一)】http://t.cn/RmXeNKR ● 【同程旅游(二)】http://t.cn/EAmsF08 ● ● 【去哪儿】http://t.cn/RTKnsL7 ● 【Shopee】http://t.cn/EbvJYl4 ● 【G7】http://t.cn/RQVePoX 融 【乐视云】http://t.cn/Rnv3IVs 【火星文化】http://t.cn/EAuvfcs ● ● 【北京银行(一)】http://t.cn/RnY8fGn 【北京银行(二)】http://t.cn/EXRSEmb ● ● 【微众银行(一)】http://1t.click/M8T 【微众银行(二)】http://1t.click/M94 ● ● 【360金融】http://t.cn/RTKnTev 【量化派】http://t.cn/EUZ8Q3o ● ● 【中国电信翼支付】http://t.cn/R3Wd9p3 【华泰证券(一)】http://t.cn/RTKmUI9 ● ● 【平安科技】https://dwz.cn/G17tL7NK 【上海证券交易所】http://t.cn/EfuJs4D ● 【Ping++】http://t.cn/RE5xYKn 【华泰证券(二)】http://1t.click/M9c ● 【贝壳金服】http://t.cn/EaXfV6h ● ● 【特来电】http://t.cn/RrHzUGW ● 【丰巢科技】http://t.cn/EAuvLIv ● ● 【盖娅互娱】http://t.cn/RT9r7hx 【威锐达测控】http://t.cn/R3CrviR ● ● 【西山居】http://t.cn/RBP12zj 【游族网络】http://t.cn/R8k4AWB 制造业 ● 【海航】http://t.cn/REXx0Qe 其他大型企业 ● ● 【某电信运营商】http://t.cn/RTYWADg 【万达网络】http://t.cn/RTKm6ds ● 【FUNYOURS JAPAN】http://t.cn/Rnoab5D
46. Thanks Home page of PingCAP https://www.pingcap.com/ Official publicity film https://www.pingcap.com/about-cn/ Project https://github.com/pingcap/tidb Docs-cn https://github.com/pingcap/docs-cn Connect info@pingcap.com AskTUG 问答社区 专业团队及 TiDB 用户答疑 定期举办线上、线下活动 优质技术资料分享

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.139.0. UTC+08:00, 2024-12-26 23:26
浙ICP备14020137号-1 $访客地图$