TiDB-A Globally Scalable NewSQL Databases
如果无法正常显示,请先停止浏览器的去广告插件。
1. TiDB: A Globally NewSQL Database
Xiaole Fang | PingCAP
User Ecosystem Group Leader
2. The C hoice of H istory
● Why NewSQL
● How to build TiDB
● How to build TiKV & PD
● How to build SQL layer (TiDB )
● How to use TiDB
3. GFS
BigTable
MapReduce
SQL
Informix
Teradata
Ingres@UCB
Network Model
Hierarchical Model
1960s
1970
1980
VoltDB
OceanBase
TiDB / TiKV
Greenplum
Vertica
Sybase
1990
2000
Spanner / F1
2010
2020
Oracle
Hana
System R@IBM
E.F. Codd
Relational Model
SQL Server
PostgreSQL
DB2
MySQL
MySQL
Sharding
Aurora
Redis
CockroachDB
HBase
Cassandra
MongoDB
PingCAP.com
4. MySQL - Read/Write Splitting
Characteristics:
Read/write splitting (application delivery)
Sharding
Traffic bottleneck of write
Cost:
Application delivery on read/write splitting
Sharding requires easier SQL
5. MySQL - Sharding
Characteristics:
Sharding
Support horizontal scalability on write
Cost:
Choose business dimension
Multi-dimension and mass redundance
Data synchronization(double write、
asynchronization and synchronization)
A transparent database
Low usage of machine resource
APP
JDB Proxy
C
6. Sharding + Proxy Cost
● Hard for application
● constraints
● Cost
● No HA
●
…
PingCAP.com
7. How many copies we need?
(Number of copies)
Online Business
System
(Total number of copies)
Data Channel
ODS, Middle
Ground
Report, BI,
Analysis
PingCAP.com
8. It’s time !
We need a new database
9. What is NewSQL?
●
●
●
●
Scale-out!
HA with strong consistency
Transparent ACID transaction support
Full-featured SQL
○
●
Cloud infrastructure friendly
○
●
Easily works with K8S and docker
High throughput and multi-tenant
○
●
Compatible with MySQL would be nice
One cluster fits for all
Try HTAP
○
ETL is boring
10. So what we should think about ?
● Computer engine
● Storage engine
● Distributed transaction
● Replication
● Data format
● Share what
● Optimizer
● …
MySQL Drivers(e.g. JDBC)
无状态的计算层
MySQL Protocol
TiDB
RPC
分布式且支持事务的 Key-
Value 存储引擎
TiKV
PD
11. We choose a key-value store (RocksDB)
● Good start, RocksDB is
fast and stable.
○
○
○
LSM better for insert
Atomic batch write
Snapshot
● However… it’s a local
embedded kv store.
○
○
Can’t tolerate machine
failures
Scalability depends on
the capacity of the disk
PingCAP.com
12. Replica
● Use Raft to replicate
● data Strong leader
● Leader election
● Membership change
● Implementation:
○
Ported from etcd
● Replicas are distributed across
machines/racks/data-centers
Client
State
Machine
a = 1
Raft
Module
b = 2
Log
State
Machine
a = 1
Raft
Module
b = 2
a = 1
b = 2
Log
State
Machine
a = 1
Raft
Module
b = 2
a = 1
b = 2
Log
a = 1
b PingCAP.com
= 2
13. Let’s fix Fault Tolerance
Raft
Raft
RocksDB RocksDB RocksDB
Machine 1 Machine 2 Machine 3
PingCAP.com
14. Region is useful
● How can we scan data?
○
TiKV Key Space
How to support API: scan (startKey, endKey, limit)
● So, we need a globally ordered map
● Region: Continuous key-value pairs, in byte order
○
○
○
Can’t use hash partitioning
Use range partitioning
■ Region 1 -> [a - d]
■ Region 2 -> [e - h]
■ …
■ Region n -> [w – z]
Fixed size:
= 96MB
[ start_key,
end_key)
(-∞, +∞)
Sorted Map
● Data are stored/accessed/replicated/scheduled
in units of Region
15. Data organization within TiDB
Local RocksDB instance
t1_r1 v1
t1_r2 v2
... ...
t5_r1 ...
Region 2 t5_r10 ...
Region 3 t1_i1_1_1 ...
Region 4 t1_i1_2_2 ...
... ...
t1_i6_1_3 ...
... ...
TiKV Node
Region 1
Store 1
Region 1
...
Region 2
Region 3
Region 4
16. How scale-out works inside TiDB
TiKV Node 1 TiKV Node 2 TiKV Node 3
Store 1 Store 2 Store 3
Region 1*
Region 1
Region 1
Let’s say, the amount of data within Region 1 exceeds
the threshold (default: 96MB)
PD
PD
PD
17. How scale-out works inside TiDB
I think I should split up Region 1
TiKV Node 1 TiKV Node 2 TiKV Node 3
Store 1 Store 2 Store 3
Region 1*
Region 1
Region 1
Let’s say, the amount of data within Region 1 exceeds
the threshold (default: 96MB)
PD
PD
PD
18. How scale-out works inside TiDB
TiKV Node 1 TiKV Node 2 TiKV Node 3
Store 1 Store 2 Store 3
Region 1*
Region 1
Region 2*
Region 1
Region 2
Region 2
Region 1 will be split into two smaller regions.
(the leader of Region 1 sends a Split command as a special log
to its replicas via the Raft protocol.
Once the Split command is successfully committed by Raft,
that means the region has been successfully split.)
PD
PD
PD
19. How scale-out works inside TiDB
TiKV Node 1 TiKV Node 2 TiKV Node 3 TiKV Node 4
Store 1 Store 2 Store 3 Store 4
Region 1*
Region 1
Region 2*
Region 1
Region 2
Region 2
PD
PD: “Hey, Node1, create a new replica of Region 2 in
Node 4, and transfer your leadership of Region 2 to
Node 2”
PD
PD
20. How scale-out works inside TiDB
TiKV Node 1 TiKV Node 2 TiKV Node 3 TiKV Node 4
Store 1 Store 2 Store 3 Store 4
Region 1*
Region 1
Region 2
Region 1
Region 2
Region 2
Region 2*
PD
PD
PD
21. How scale-out works inside TiDB
TiKV Node 1 TiKV Node 2 TiKV Node 3 TiKV Node 4
Store 1 Store 2 Store 3 Store 4
Region 1*
Region 1
Region 1
Region 2
Region 2
Region 2*
PD
PD: “OK, Node 1, delete your local replica of Region 2”
PD
PD
22. Region split & merge
Split
Region 1
[a-n)
Region 1
[a-z)
increase
Region 1
[a-z)
decrease
Region 1
[a-n)
Split
Merge
decrease
Region 2
[n-z)
Region 1
[a-z)
Region 2
[n-z)
Merge
Region splitting and merging affect all replicas of one region.
The correctness and consistency are guaranteed by Raft.
23. PD as the Cluster Manager
Node 1
PD
HeartBeat
Cluster
Info
Region A
Region B
Movement
Node 2
Scheduling
Command
Admin
Scheduling
Stratege
Config
Region C
PingCAP.com
24. Now we have a complete key-value store
Client
RPC
RPC
RPC
RPC
Placement
Driver
PD 1
Raft Group
Store 2 Region 1 Region 1 Region 2 Region 1
Region 3 Region 2 Region 5 Region 2
Region 5 Region 4 Region 3 Region 5
Region 4 Region 3
TiKV node 1
TiKV node 2
Store 3
Store 4
Store 1
PD 2
PD 3
Region 4
TiKV node 3
TiKV node 4
PingCAP.com
25. TiKV: Architecture overview (Logical)
We have A Distributed Key-Value Database with
○ Geo-Replication / Auto Rebalance
○ ACID Transaction support
○ Horizontal Scalability
API (gRPC)
Transactional KV API
Transaction
TiKV
MVCC
Multi-Raft (gRPC)
Raw KV API
RocksDB
PingCAP.com
26. MVCC (Multi-Version Concurrency Control)
● Each transaction sees a snapshot of database at the beginning time of this
transaction, any changes made by this transaction will not be seen by other
transactions until the transaction is committed
● Data is tagged with versions
○ Key_version: value
● Lock-free snapshot reads
PingCAP.com
27. Transaction in TiKV
• Nothing fancy: 2-Phase Commit
• Almost decentralized
• Timestamp allocator
•
•
•
•
Paper: Google Percolator
Optimistic transaction model
Default isolation level: Snapshot Isolation
Also has Read Committed isolation level
PingCAP.com
28. Now let’s talk about SQL
29. What if we support SQL?
Pkey Name age Email
1 Edward 29 h@pingcap.com
2 Tom 38 tom@pingcap.com
Dictionary order
TiKV
tidb-server
Tables (Rows with columns)
t1_r1 {col1: v1, col2: v2, col3: v3 ...}
t1_r2 {col1: vv1, col2: vv2, col3: vv3 ...}
... ...
t2_r1 {col1: v1, col2: v2 ...}
... ...
Key-value pairs within TiKV
ordered by key
PingCAP.com
30. Secondary index
Index: Name
Pkey Name Email
1 Edward h@pingcap.com
2 Tom tom@pingcap.com
...
Index Value Pkey
Edward 1
Tom 2
tbl/user/1 Edward,h@pingcap.com
tbl/user/2 Tom,tom@pingcap.com
...
...
Data table
Index Value Pkey
h@pingcap.com 1
tom@pingcap.com 2
Index: Email
idx/user/name/Edward 1
idx/user/name/Tom 2
...
...
idx/user/mail/h@pingcap.com 1
idx/user/mail/tom@pingcap.com 2
...
PingCAP.com
31. Schema change in distributed RDBMS?
● A must-have feature, especially for
large tables(billions of rows)!
● But you don’t want to lock the whole
table while changing schema.
○
Usually distributed database stores tons
of data spanning multiple machines
● We need a non-blocking schema
change algorithm
● Thanks F1 again
○
Similar to Online, Asynchronous Schema
Change in F1 - VLDB 2013 Google
PingCAP.com
32. Predicate pushdown
TiDB Server
age > 20 and age < 30
TiDB knows that
Region 1 / 2 / 5
stores the data of
person table.
age > 20 and age < 30
age > 20 and age < 30
Coprocessor
Coprocessor
Coprocessor
Region 1 Region 2 Region 5
TiKV Node1 TiKV Node2 TiKV Node3
33. Now let’s talk about SQL layer.
34. TiDB (TiDB–server)
●
●
Stateless SQL layer
○ Clients can connect to any existing
tidb-server instance
Full-featured SQL Layer
○ Speaks MySQL wire protocol
○ CBO
○ Secondary index support
○ Online DDL
SQL
Optimized
Logical Plan
Cost Model
Statistics
Selected
Physical Plan
tidb-server
TiKV
Logical
Plan
AST
TiKV
TiKV
TiKV
TiKV
TiKV
TiKV Cluster
PingCAP.com
35. Parallel operators
SELECT t.c2, t1.c2 FROM t JOIN t1 on t.c = t1.c WHERE t1.c1 > 10;
Projection
Join Operator
Data Reader
Join Worker
Join Worker
Join Worker
Join
DataSourc
e t
Filter
t1.c1 > 10
TableScan: t IndexScan: t1 idx1
Scan
Workers Scan
Workers
DataSourc
e t1
TiKV Cluster: Coprocessor Workers
PingCAP.com
36. Architecture of TiDB
37. PD
PD
Meta data
PD
TiDB
PD Cluster
TSO/Data
location
MySQL clients
Schedule
KV API
TiKV
Coprocessor
Load Balancer (Optional)
Txn, Transaction
MySQL Protocol MySQL Protocol
TiDB SQL Layer TiDB SQL Layer
KV API
DistSQL API
TiDB Server (Stateless)
KV API
DistSQL API
DistSQL API
MVCC
RawKV, Raft KV
TiDB Server (Stateless)
RocksDB
Pluggable Storage Engine (e.g. TiKV)
General Architecture
38. Spark Cluster
TiSpark
Worker
TiFlash Node 2
TiDB
TiDB
TiSpark
Worker
TiFlash Node 1
TiFlash Extension Cluster
TiKV Node 1
TiKV Node 2
TiKV Node 3
Store 1 Store 2 Store 3
Region 1 Region 4 Region 2
Region 2 Region 3 Region 3
Region 3 Region 2 Region 4
Region 4 Region 1 Region 1
TiKV Cluster
39. HTAP
PD
PD
Data location
TSO/Data location
PD
PD Cluster
Metadata
Spark
Driver
TiDB
TiKV
TiKV
Job
MySQL Clients
TiDB
TiDB
Syncer
DistSQL API
DistSQL API
TiKV
Worker
TiKV
Worker
TiDB
TiDB
...
TiKV
TiKV
Worker
...
TiKV Cluster
... (Storage)
Spark Cluster
DistSQL API
TiDB Cluster
flash
flash
TiSpark
PingCAP.com
40. So we can...
Spark on
TiKV …
MPP CBO
hash join …
index Auto
region split
Column base +
Spark …
Complex
query
Complex aggregation or analysis
query
Big table
join
High concurrent
writing
OLTP
OLAP
41. Ecosystem Tools
●
MySQL Binlog
Lightning
○
●
Upstream
Database
Fast offline data importing, MySQL => TiDB
DM Master
DM (Data Migration)
○
DM Worker
TiDB
TiKV
TiDB Binlog
○
DM
Data migration tool, makes TiDB as a slave
replica of a MySQL/MariaDB master
●
DM Worker
TiKV Cluster
TiDB
Change data capture (CDC) streaming, sync all
TiKV
TiDB
change within TiDB to downstream application
like Kafka/MySQL/…
● All open-sourced!
TiDB-Binlog
Pump
Drainer
Pump
Pump
...
Downstream
Database
TiKV
42. TiDB-DM: TiDB Data Migration Tool
Master
Master
syncer
Master
syncer
Master
syncer
syncer
Master
syncer
Slave cluster
PingCAP.com
43. Cloud TiDB
Database service -
business tenant
TiDB Operator
Deployment
TiDB Cluster Controller
TiKV Controller
PD Controller
TiDB Controller
Database analysis -
business tenant
DBaaS/DBA
TiDB Scheduler:
Kube Scheduler +
Scheduler Extender
DaemonSet
Volume Manager
GC Controller
TiDB Cloud Manager
RESTFul Interface
External Service manager
Load balancer manager
TiDB Cloud
DaemonSet
Kubernetes Core
Scheduler
Controller Manager
API Server
Container Cluster
PingCAP.com
44. TiDB Application Scenarios
● Massive data - high concurrency OLTP system
● Multiple data centers at financial level
● Multi-sources high throughout convergence
and real-time computing
● Real-time database warehouse
● High-performance HTAP
● DBaaS
45. TiDB 部分用户公开案例(Oct 2019)
互联网
金
●
● 【美团点评】http://t.cn/EAFCqhl
【今日头条】http://t.cnEMf ●
● 【知乎】https://dwz.cn/er7iwbIz
【小红书】http://1t.click/M9v ●
● 【一面数据】http://t.cn/RT9r5di
【猿辅导】http://t.cn/RTKnKSX
●
● 【转转(一)】http://t.cn/R1MAXEq
【转转(二)】http://t.cn/EfuJ4vi ●
● 【爱奇艺】http://t.cn/EvErsc1
【易果生鲜】http://t.cn/RTYVhzH ●
● 【凤凰网】http://t.cn/RHRQfNT
【Mobikok】http://t.cn/Rm1F6lg
●
● 【摩拜单车(一)】http://t.c/RnLfn/RT8FbP6 ●
【摩拜单车(二)】http://t.cn/EVjPYgj
● 【二维火】http://t.cn/R8bXM2f
【客如云】http://t.cn/R1wSEJH
●
● 【小米科技】http://t.cn/Ey2xCDK
【零氪科技】http://t.cn/REj7tSv ●
● 【同程旅游(一)】http://t.cn/RmXeNKR ●
【同程旅游(二)】http://t.cn/EAmsF08 ●
●
【去哪儿】http://t.cn/RTKnsL7
●
【Shopee】http://t.cn/EbvJYl4 ● 【G7】http://t.cn/RQVePoX
融
【乐视云】http://t.cn/Rnv3IVs
【火星文化】http://t.cn/EAuvfcs
●
● 【北京银行(一)】http://t.cn/RnY8fGn
【北京银行(二)】http://t.cn/EXRSEmb ●
● 【微众银行(一)】http://1t.click/M8T
【微众银行(二)】http://1t.click/M94 ●
● 【360金融】http://t.cn/RTKnTev
【量化派】http://t.cn/EUZ8Q3o
●
● 【中国电信翼支付】http://t.cn/R3Wd9p3
【华泰证券(一)】http://t.cn/RTKmUI9 ●
● 【平安科技】https://dwz.cn/G17tL7NK
【上海证券交易所】http://t.cn/EfuJs4D ● 【Ping++】http://t.cn/RE5xYKn
【华泰证券(二)】http://1t.click/M9c ● 【贝壳金服】http://t.cn/EaXfV6h
● ● 【特来电】http://t.cn/RrHzUGW ● 【丰巢科技】http://t.cn/EAuvLIv
●
● 【盖娅互娱】http://t.cn/RT9r7hx
【威锐达测控】http://t.cn/R3CrviR ●
● 【西山居】http://t.cn/RBP12zj
【游族网络】http://t.cn/R8k4AWB
制造业
●
【海航】http://t.cn/REXx0Qe
其他大型企业
●
● 【某电信运营商】http://t.cn/RTYWADg
【万达网络】http://t.cn/RTKm6ds
● 【FUNYOURS JAPAN】http://t.cn/Rnoab5D
46. Thanks
Home page of PingCAP
https://www.pingcap.com/
Official publicity film
https://www.pingcap.com/about-cn/
Project
https://github.com/pingcap/tidb
Docs-cn
https://github.com/pingcap/docs-cn
Connect
info@pingcap.com
AskTUG 问答社区
专业团队及 TiDB 用户答疑
定期举办线上、线下活动
优质技术资料分享