POSTGRESQL AT LOW LEVEL
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. POSTGRESQL
AT LOW LEVEL
STAY CURIOUS!
DMITRY DOLGOV
17-05-2019
2. patroni
1
& postgres-operator
3. pg_stat_*
PG
K8S
2
4. pg_stat_*
CPU/IO
PG
OS
K8S
2
5. pg_stat_*
CPU/IO
PG
???
CG
K8S
2
OS
6. pg_stat_*
CPU/IO
PG
???
???
VM
K8S
2
CG
OS
7. pg_stat_*
CPU/IO
PG
???
???
???
2
VM
K8S
CG
OS
8. 3
9. Plan?
4
10. A bit chaotic
dailymail.co.uk
4
11. Info sources
source code
strace/GDB/Perf
procfs/sysfs
BPF/eBPF/BCC
5
12. Shared memory
ERROR: could not resize shared memory segment
"/PostgreSQL.699663942" to 50438144 bytes:
No space left on device
6
13. # strace -k -p PID
openat(AT_FDCWD, "/dev/shm/PostgreSQL.62223175"
ftruncate(176, 50438144)
= 0
fallocate(176, 0, 0, 50438144)
= -1 ENOSPC
> libc-2.27.so(posix_fallocate+0x16) [0x114f76]
> postgres(dsm_create+0x67) [0x377067]
...
> postgres(ExecInitParallelPlan+0x360) [0x254a80]
> postgres(ExecGather+0x495) [0x269115]
> postgres(standard_ExecutorRun+0xfd) [0x25099d]
...
> postgres(exec_simple_query+0x19f) [0x39afdf]
7
14. # strace -k -p PID
openat(AT_FDCWD, "/dev/shm/PostgreSQL.62223175"
ftruncate(176, 50438144)
= 0
fallocate(176, 0, 0, 50438144)
= -1 ENOSPC
> libc-2.27.so(posix_fallocate+0x16) [0x114f76]
> postgres(dsm_create+0x67) [0x377067]
...
> postgres(ExecInitParallelPlan+0x360) [0x254a80]
> postgres(ExecGather+0x495) [0x269115]
> postgres(standard_ExecutorRun+0xfd) [0x25099d]
...
> postgres(exec_simple_query+0x19f) [0x39afdf]
7
15. vDSO
# strace -k -p PID on XEN
gettimeofday({tv_sec=1550586520, tv_usec=313499}, NULL) = 0
> [vdso]() [0xef0]
Two frequently used system calls are 77% slower on AWS EC2
8
16. Scheduling
T2
9
c
T3
c
17. Scheduling
T2
9
c
T3
c
18. Andres Freund: New intel MDS vulnerability mitigations cause measurable slowdown
10
19. MDS
# Children
# ........
71.06%
71.06%
56.82%
25.19%
25.14%
23.60%
11
Self
........
0.00%
0.00%
0.14%
0.06%
0.29%
0.14%
Symbol
...................................
[.] __libc_start_main
[.] PostmasterMain
[.] exec_simple_query
[k] entry_SYSCALL_64_after_hwframe
[k] do_syscall_64
[.] standard_ExecutorRun
20. MDS
# Children
# ........
71.06%
71.06%
56.82%
25.19%
25.14%
23.60%
11
Self
........
0.00%
0.00%
0.14%
0.06%
0.29%
0.14%
Symbol
...................................
[.] __libc_start_main
[.] PostmasterMain
[.] exec_simple_query
[k] entry_SYSCALL_64_after_hwframe
[k] do_syscall_64
[.] standard_ExecutorRun
21. MDS
# Percent
# ........
0.01% :
28.94% :
0.55% :
3.24% :
12
Disassembly of kcore for cycles
................................
nopl
0x0(%rax,%rax,1)
verw
0xffe9e1(%rip)
pop
%rbx
pop
%rbp
22. MDS
# Percent
# ........
0.01% :
28.94% :
0.55% :
3.24% :
12
Disassembly of kcore for cycles
................................
nopl
0x0(%rax,%rax,1)
verw
0xffe9e1(%rip)
pop
%rbx
pop
%rbp
23. MDS
# Overhead
# ........
25.19%
13
Symbol
...................................
[k] native_safe_halt
24. MDS
static inline __cpuidle void native_safe_halt(void)
{
mds_idle_clear_cpu_buffers();
asm volatile("sti; hlt": : :"memory");
}
13
25. MDS
static inline __cpuidle void native_safe_halt(void)
{
mds_idle_clear_cpu_buffers();
asm volatile("sti; hlt": : :"memory");
}
13
26. Huge pages
transparent vs classic
TLB misses are faster and less frequent
14
27. Huge pages
# perf record -e dTLB-loads,dTLB-stores -p PID
# huge_pages on
Samples: 832K of event 'dTLB-load-misses'
Event count (approx.): 640614445 : ~19% less
Samples: 736K of event 'dTLB-store-misses'
Event count (approx.): 72447300 : ~29% less
# huge_pages off
Samples: 894K of event
Event count (approx.):
Samples: 822K of event
Event count (approx.):
15
'dTLB-load-misses'
784439650
'dTLB-store-misses'
101471557
28. Huge pages
# perf record -e dTLB-loads,dTLB-stores -p PID
# huge_pages on
Samples: 832K of event 'dTLB-load-misses'
Event count (approx.): 640614445 : ~19% less
Samples: 736K of event 'dTLB-store-misses'
Event count (approx.): 72447300 : ~29% less
# huge_pages off
Samples: 894K of event
Event count (approx.):
Samples: 822K of event
Event count (approx.):
15
'dTLB-load-misses'
784439650
'dTLB-store-misses'
101471557
29. VM
:
:
:
:
Lock holder preemption problem
Lock waiter preemption problem
Intel PLE (pause loop exiting)
PLE_Gap, PLE_Window
Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 3
16
30. vCPU
vC1
vC2
vC3
Hypervisor
17
vC4
31. vCPU
vC1
vC2
vC3
Hypervisor
17
vC4
32. vCPU
vC1
vC2
vC3
Hypervisor
17
vC4
33. # latency average = 17.782 ms
=> modprobe kvm-intel ple_gap=128
=> perf record -e kvm:kvm_exit
reason PAUSE_INSTRUCTION 306795
18
34. # latency average = 17.782 ms
=> modprobe kvm-intel ple_gap=128
=> perf record -e kvm:kvm_exit
reason PAUSE_INSTRUCTION 306795
# latency average = 16.858 ms
=> modprobe kvm-intel ple_gap=0
=> perf record -e kvm:kvm_exit
reason PAUSE_INSTRUCTION 0
18
35. # latency average = 17.782 ms
=> modprobe kvm-intel ple_gap=128
=> perf record -e kvm:kvm_exit
reason PAUSE_INSTRUCTION 306795
# latency average = 16.858 ms
=> modprobe kvm-intel ple_gap=0
=> perf record -e kvm:kvm_exit
reason PAUSE_INSTRUCTION 0
18
36. 19
37. Userspace
vfs_read
Bytecode
Regs
…
20
Stack
…
Maps
…
38. Userspace
vfs_read
Bytecode
Regs
…
20
Stack
…
Maps
…
39. Userspace
vfs_read
Bytecode
Regs
…
20
Stack
…
Maps
…
40. Tunables
# from /proc/sys/kernel/
sched_wakeup_granularity_ns
# default = 1 msec * (1 + ilog(ncpus))
21
41. pgbench and pg_dump
usecs
0
2
4
8
16
32
64
128
256
512
1024
2048
user
sys
real
22
->
->
->
->
->
->
->
->
->
->
->
->
1
3
7
15
31
63
127
255
511
1023
2047
4095
1m9.127s
0m2.066s
1m38.990s
:
:
:
:
:
:
:
:
:
:
:
:
:
count
16
4604
6812
14888
19267
65795
50454
16393
5981
12300
48
0
distribution
:
:
:**
:
:****
:
:*********
:
:***********
:
:****************************************:
:******************************
:
:*********
:
:***
:
:*******
:
:
:
:
:
42. pgbench and pg_dump
usecs
0
2
4
8
16
32
64
128
256
512
1024
2048
user
sys
real
22
->
->
->
->
->
->
->
->
->
->
->
->
1
3
7
15
31
63
127
255
511
1023
2047
4095
1m9.127s
0m2.066s
1m38.990s
:
:
:
:
:
:
:
:
:
:
:
:
:
count
16
4604
6812
14888
19267
65795
50454
16393
5981
12300
48
0
distribution
:
:
:**
:
:****
:
:*********
:
:***********
:
:****************************************:
:******************************
:
:*********
:
:***
:
:*******
:
:
:
:
:
43. pgbench and pg_dump
usecs
0
2
4
8
16
32
64
128
256
512
1024
2048
user
sys
real
23
->
->
->
->
->
->
->
->
->
->
->
->
1
3
7
15
31
63
127
255
511
1023
2047
4095
1m8.559s
0m1.641s
1m32.030s
:
:
:
:
:
:
:
:
:
:
:
:
:
count
1
8
25
46
189
119
96
93
238
323
1012
47
distribution
:
:
:
:
:
:
:*
:
:*******
:
:****
:
:***
:
:***
:
:*********
:
:************
:
:****************************************:
:*
:
44. pgbench and pg_dump
usecs
0
2
4
8
16
32
64
128
256
512
1024
2048
user
sys
real
23
->
->
->
->
->
->
->
->
->
->
->
->
1
3
7
15
31
63
127
255
511
1023
2047
4095
1m8.559s
0m1.641s
1m32.030s
:
:
:
:
:
:
:
:
:
:
:
:
count
1
8
25
46
189
119
96
93
238
323
1012
47
distribution
:
:
:
:
:
:
:*
:
:*******
:
:****
:
:***
:
:***
:
:*********
:
:************
:
:****************************************:
:*
:
45. github.com/iovisor/bcc/
github.com/erthalion/postgres-bcc
24
46. Cache
=> llcache_per_query.py bin/postgres
PID QUERY
CPU REFERENCE MISS
HIT%
9720 UPDATE pgbench_tellers ... 0
2000 1000 50.00%
9720 SELECT abalance FROM
... 2
2000 100 95.00%
...
Total References: 3303100 Total Misses: 599100 Hit Rate: 81.86%
25
47. Remember?
26
48. Shared memory
=> shmem.py bin/postgres
mmap:
[20439]: 142M
anon shm:
[20439]: 56B
shm:
[postmaster.opts]: 0B
[PostgreSQL.57332071]: 7K
27
49. Dirty pages
bgw
linux
chkp
OS Cache
Storage
28
50. Dirty pages
bgw
linux
chkp
OS Cache
Storage
28
51. Dirty pages
bgw
linux
chkp
OS Cache
Storage
28
52. Dirty pages
bgw
linux
chkp
OS Cache
Storage
28
53. Writeback (cgroup v1)
/* vmscan.c */
/* The normal page dirty throttling mechanism
* in balance_dirty_pages() is completely broken
* with the legacy memcg and direct stalling in
* shrink_page_list() is used for throttling instead,
* which lacks all the niceties such as fairness,
* adaptive pausing, bandwidth proportional
* allocation and configurability.
*/
static bool sane_reclaim(struct scan_control *sc)
29
54. Pages written, kernel
30
55. Writeback
=> perf record -e writeback:writeback_written
kworker/u8:1 reason=periodic
nr_pages=101429
kworker/u8:1 reason=background nr_pages=MAX_ULONG
kworker/u8:3 reason=periodic
nr_pages=101457
31
56. Writeback
# pgbench insert workload
=> io_timeouts.py bin/postgres
[18335]
[18333]
[18331]
[18318]
32
END: MAX_SCHEDULE_TIMEOUT
END: MAX_SCHEDULE_TIMEOUT
END: MAX_SCHEDULE_TIMEOUT
truncate pgbench_history: MAX_SCHEDULE_TIMEOUT
57. Kubernetes
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
33
58. Kubernetes
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
33
soft_limits_in_bytes
limits_in_bytes
59. 34
60. Kubernetes
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
35
soft_limits_in_bytes
limits_in_bytes
61. Memory reclaim
# only under the memory pressure
=> page_reclaim.py --container 89c33bb3133f
[7382]
[7138]
[7136]
[7468]
[7464]
[5451]
36
postgres:
postgres:
postgres:
postgres:
postgres:
postgres:
928K
152K
180K
72M
57M
1M
62. How to run?
# bcc + postgres-bcc
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_NET_CLS_BPF=m
CONFIG_NET_ACT_BPF=m
CONFIG_BPF_JIT=y
CONFIG_BPF_EVENTS=y
debugfs on /sys/kernel/debug type debugfs (rw)
37
63. How to run: container?
# sometimes you also need to let perf know
# where to find debugging symbols, e.g. copy
# from /usr/lib/.debug/
docker run
--priviledged
--net=container:<container-id>
--ipc=container:<container-id>
38
64. How to run: K8S?
spec:
serviceAccountName: "bcc"
hostPID: true
containers:
- name: "bcc"
securityContext:
privileged: true
# 4 * 65536 + 14 * 256 + 96
=> export BCC_LINUX_VERSION_CODE 265824
39
65. How to break?
# unsafe access
=> perf probe -x bin/postgres --funcs
=> perf probe -x bin/postgres 'ExecCallTriggerFunc trigdata->?'
=> perf record probe_postgres:ExecCallTriggerFunc
40
66. How to break?
# non interruptible sleep
=> perf probe -x bin/postgres --funcs
=> perf probe -x bin/postgres 'XLogInsertRecord fpw_lsn'
41
67. How to break?
42
68. Questions?
github.com/erthalion
github.com/erthalion/postgres-bcc
@erthalion
dmitrii.dolgov at zalando dot de
9erthalion6 at gmail dot com
43