TiDB 可观测性的设计与实现

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. TiDB 可观测性 的设计与实现 – 陈霜
2.
3. About me ● 陈霜, TiDB Insight R&G Engineer, PingCAP ● chenshuang@pingcap.com ● Github: crazycs520
4. Agenda ● TopSQL: Bind SQL With CPU Resources ● System Table ● ○ SQL Statements Implementation ○ SLOW QUERY Implementation Continuous Profiling Implementation
5. TopSQL: Bind SQL With CPU Resources
6. Background ● Database Performance is getting worse! ○ QPS is dropping ○ Latency is rising ● What OS tells us? ○ top / htop ○ High CPU usage ● Which line of code is running? ○ CPU Profiler
7. TopSQL
8. How CPU Profiling works: Sampling
9. How CPU Profiler works: Accumulation Stack Count A 1 A -> B 2 A -> B -> C 1 A -> D 1
10. CPU Profiling By SQL Request
11. CPU Profiling By SQL Request Request tag Count Request_ 1 10 Request_ 2 20 Request_ 3 30 Request_ 4 40 CPU Usage 40 30 20 10 1 2 3 4 Request ID
12. CPU Profiling By SQL Request ● rust-demo ● go-demo
13.
14. TopSQL: SQL Tag - Digest Original SQL select id from t where a=1; select id from t where a=2; Normalization Normalized SQL select id from t where a=?; Hash SQL Digest e27d8aa363129bf86cc3b35eb7a042c770802a539716e20ea2af30bd924 e5909
15. TopSQL CPU Usage 1 1 2 3 3 1 1 4 T1 2 3 4 T2 T3 2 3 4 T4 4 T5 1 2 3 3 3 4 4 T6 2 T7 1 SELECT * FROM t1 WHERE id = ? 2 UPDATE t2 SET a=a+1 WHERE id = ? 3 DELETE FROM t3 WHERE id = ? 4 SELECT COUNT(*) FROM t1 4 T8 Time
16. TopSQL
17. TopSQL MySQL Clients SQL Request CPU Usage By Tag TSO/Data location TiDB PD TiDB PD DistSQL API Ta g NG-Monitor PD TiKV TiKV TiKV CPU Usage By Tag Metadata TiKV Storage TiKV TiKV
18. Another approach to CPU resource bind in Go
19. Goroutine CPU Stats ● Try to collect goroutine runtime information. begin := GetGoroutineStats() executeSQL() end := GetGoroutineStats() queryCost := end.Sub(begin) // 获取执行 SQL 消耗的 cpu stats 信息
20. Trace ● Use Go trace to collect runtime information. curl http://localhost:10080/debug/pprof/trace\?seconds\=1 --output trace.out go tool trace trace.out ● In Goroutine analysis Page, chose the Goroutine you want to view.
21. Trace - drawback ● Large performance impact. ● May generate a lot of trace data, and then parsing this data is also consume a lot of time.
22. How Go trace works? Time(ms) Event goroutine_id description 0 GoCreate 1 创建一个 goroutine 10 GoStart 1 开始运行 goroutine 30 GoBlockSelect 1 被 block 了,暂停运行 goroutine 50 GoUnpark 1 goroutine 没有被 block 了 60 GoStart 1 开始运行 goroutine 80 GoEnd 1 结束 goroutine
23. How Go trace works? Time(ms) Event goroutine_id description 0 GoCreate 1 创建一个 goroutine 10 GoStart 1 开始运行 goroutine 30 GoBlockSelect 1 被 block 了,暂停运行 goroutine 50 GoUnpark 1 goroutine 没有被 block 了 60 GoStart 1 开始运行 goroutine 80 GoEnd 1 结束 goroutine Type Duration Calculate total_time 80ms 80-0 exec_time 40ms (30-10) + (80-60) sync_block_time 20ms 50 - 30 scheduler_wait_time 20ms (10-0) + (60-50)
24. Modify Go runtime to collect Goroutine stats type g struct { goid int64 // goroutine id. . . . stats GStats // goroutine runtime stats } // 创建 goroutine 时 traceGoCreate(newg, newg.startpc) newg.stats.recordGoCreate() // 开始执行 goroutine 时 traceGoStart() gp.stats.recordGoStart() type GStats struct { creationTime int64 endTime int64 lastStartTime blockSchedTime // goroutine 被抢占而停止运行时 int64 int64 traceGoPreempt() gp.stats.recordGoSched() ... execTime int64 schedWaitTime int64 ... } // 其他的埋点代码 … ...
25. Modify Go runtime to collect Goroutine stats ● demo: crazycs520/go # build git clone -b stats-dev2 https://github.com/crazycs520/go.git cd go/src ./make.bash cd .. export GOROOT=$PWD export PATH=$GOROOT/bin:$PATH # demo $ cd example $ go run goroutine_stats.go total: 1.054196173s, exec: 366.938307ms, network_wait: 131.648319ms, sync_block: 443.454523ms, block_syscall: 0s, scheduler_wait: 148.731µs, gc_sweep: 7.82µs
26.
27. Modify Go runtime to collect Goroutine stats ● Drawback ○ Need to maintain a go branch by ourself. ● Related Issue: ○ proposal: runtime: add per-goroutine CPU stats · Issue #41554
28. System Table
29. SQL Statements Implementation
30. STATEMENTS_SUMMARY ● Groups the SQL statements by the SQL digest and the plan digest, and provides statistics for each SQL category. ● Is a system memory table in INFORMATION_SCHEMA ○ LRU Cache ○ At most store 3000 kinds of SQL statements. Control by tidb_stmt_summary_max_stmt_count Is periodically cleared, and only recent (30min by default) aggregated results are retained and displayed. ○ tidb_stmt_summary_refresh_interval ● ● STATEMENTS_SUMMARY_HISTORY saves the historical data of STATEMENTS_SUMMARY ○ tidb_stmt_summary_history_size
31. Slow Query Implementation
32. SLOW_QUERY ● Is a system table in INFORMATION_SCHEMA ● The table data is parsed from slow-query-file. ● Use SQL to query SLOW_QUERY instead of grep slow-query-file.
33. SLOW QUERY FILE The slow-query-file consists of SQL statements that take more than tidb_slow_log_threshold milliseconds to execute. An example record in slow-query-file : # Time: 2019-08-14T09:26:59.487776265+08:00 # Txn_start_ts: 410450924122144769 # User: root@127.0.0.1 # Conn_ID: 3086 # Query_time: 1.527627037 # Parse_time: 0.000054933 # Compile_time: 0.000129729 # Cop_time: 0.387091637 Process_time: 0.613 Request_count: 2 Total_keys: 131073 Process_keys: 131072 Prewrite_time: 0.335415029 Commit_time: 0.032175429 Get_commit_ts_time: 0.000177098 Local_latch_wait_time: 0.106869448 Write_keys: 131072 Write_size: 3538944 Prewrite_region: 1 # DB: test # Is_internal: false # Digest: 50a2e32d2abbd6c1764b1b7f2058d428ef2712b029282b776beb9506a365c0f1 # Stats: t:pseudo # Num_cop_tasks: 2 # Cop_proc_avg: 0.3065 Cop_proc_p90: 0.322 Cop_proc_max: 0.322 Cop_proc_addr: 127.0.0.1:20160 # Mem_max: 525211 # Prepared: false # Plan_from_cache: false # Succ: true # Plan: tidb_decode_plan('ZJAwCTMyXzcJMAkyMAlkYXRhOlRhYmxlU2Nhbl82CjEJMTBfNgkxAR0AdAEY1Dp0LCByYW5nZTpbLWluZiwraW5mXSwga2VlcCBvcmRlcjpmYWxzZSwgc3RhdHM6cHNldWRvCg==') insert into t select * from t; The format is same with mysql slow query log,so tools like pt-query-digest also work.
34. CLUSTER TABLES STATEMENTS_SUMMARY/SLOW_QUERY only contains the data of current tidb-server. CLUSTER_STATEMENTS_SUMMARY/CLUSTER_SLOW_QUERY are system table of TiDB, It contains the related data of all tidb-servers. TiDB Cluster CLUSTER_SLOW_QUERY SLOW_QUERY SLOW_QUERY SLOW_QUERY TiDB TiDB TiDB
35. Continuous Profiling
36. Continuous Profiling
37. Continuous Profiling TiDB Cluster PD TiDB TiKV NG-Monitor
38.
39. Q&A

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.139.0. UTC+08:00, 2025-01-11 00:52
浙ICP备14020137号-1 $访客地图$