LISA2019 Linux Systems Performance
如果无法正常显示,请先停止浏览器的去广告插件。
1. Oct, 2019
Linux Systems
Performance
Brendan Gregg
Senior Performance Engineer
USENIX LISA 2019, Portland, Oct 28-30
2. Experience: A 3x Perf Difference
3. mpstat
load averages: serverA 90, serverB 17
serverA# mpstat 10
Linux 4.4.0-130-generic (serverA) 07/18/2019 _x86_64_ (48 CPU)
10:07:55
10:08:05
10:08:15
10:08:25
[...]
Average:
PM
PM
PM
PM
CPU
all
all
all %usr
89.72
88.60
89.71 %nice
0.00
0.00
0.00
all 89.49 0.00
%sys %iowait
7.84
0.00
9.18
0.00
9.01
0.00 %irq
0.00
0.00
0.00 %soft
0.04
0.05
0.05 %steal
0.00
0.00
0.00 %guest
0.00
0.00
0.00 %gnice
0.00
0.00
0.00 %idle
2.40
2.17
1.23
8.47 0.00 0.05 0.00 0.00 0.00 1.99
0.00
serverB# mpstat 10
Linux 4.19.26-nflx (serverB) 07/18/2019 _x86_64_ (64 CPU)
09:56:11
09:56:21
09:56:31
09:56:41
[...]
Average:
PM
PM
PM
PM
CPU
all
all
all %usr
23.21
20.21
21.58 %nice
0.01
0.00
0.00
all 21.50 0.00
%sys %iowait
0.32
0.00
0.38
0.00
0.39
0.00 %irq
0.00
0.00
0.00 %soft
0.10
0.08
0.10 %steal
0.00
0.00
0.00 %guest
0.00
0.00
0.00 %gnice
0.00
0.00
0.00 %idle
76.37
79.33
77.92
0.36 0.00 0.09 0.00 0.00 0.00 78.04
0.00
4. pmcarch
serverA# ./pmcarch -p 4093 10
K_CYCLES
K_INSTR
IPC BR_RETIRED
982412660 575706336
0.59 126424862460
999621309 555043627
0.56 120449284756
991146940 558145849
0.56 126350181501
996314688 562276830
0.56 122215605985
979890037 560268707
0.57 125609807909
^C
serverB# ./pmcarch -p 1928219 10
K_CYCLES
K_INSTR
IPC BR_RETIRED
147523816 222396364
1.51 46053921119
156634810 229801807
1.47 48236123575
152783226 237001219
1.55 49344315621
140787179 213570329
1.52 44518363978
136822760 219706637
1.61 45129020910
BR_MISPRED
2416880487
2317302514
2530383860
2348638980
2386085660 BMR%
1.91
1.92
2.00
1.92
1.90 LLCREF
15724006692
15378257714
15965082710
15558286345
15828820588 LLCMISS
10872315070
11121882510
11464682655
10835594199
11038597030
BR_MISPRED
641813770
653064504
692819230
631588112
651436401 BMR%
1.39
1.35
1.40
1.42
1.44 LLCREF
8880477235
9186609260
9314992450
8675999448
8689831639 LLCMISS
968809014
1183858023
879494418
712318917
617678747
LLC%
30.86
27.68
28.19
30.35
30.26
LLC%
89.09
87.11
90.56
91.79
92.89
5. perf
serverA# perf stat -e cs -a -I 1000
#
time
counts unit events
1.000411740
2,063,105
cs
2.000977435
2,065,354
cs
3.001537756
1,527,297
cs
4.002028407
515,509
cs
5.002538455
2,447,126
cs
[...]
serverB# perf stat -e cs -p 1928219 -I 1000
#
time
counts unit events
1.001931945
1,172
cs
2.002664012
1,370
cs
3.003441563
1,034
cs
4.004140394
1,207
cs
5.004947675
1,053
cs
[...]
6. bcc/BPF
serverA# /usr/share/bcc/tools/cpudist -p 4093 10 1
Tracing on-CPU time... Hit Ctrl-C to end.
usecs
0
2
4
8
16
32
->
->
->
->
->
->
1
3
7
15
31
63
:
:
:
:
:
:
:
count
3618650
2704935
421179
99416
16951
6355
distribution
|****************************************|
|*****************************
|
|****
|
|*
|
|
|
|
|
[...]
serverB# /usr/share/bcc/tools/cpudist -p 1928219 10 1
Tracing on-CPU time... Hit Ctrl-C to end.
usecs
256
512
1024
2048
4096
8192
16384
[...]
->
->
->
->
->
->
->
511
1023
2047
4095
8191
16383
32767
:
:
:
:
:
:
:
:
count
44
156
238
4511
277
286
77
distribution
|
|
|*
|
|**
|
|****************************************|
|**
|
|**
|
|
|
7. Systems Performance in 45 mins
• This is slides + discussion
• For more detail and stand-alone texts:
8. Agenda
1.
2.
3.
4.
5.
6.
Observability
Methodologies
Benchmarking
Profiling
Tracing
Tuning
9.
10. 1. Observability
11. How do you measure these?
12. Linux Observability Tools
13. Why Learn Tools?
• Most analysis at Netflix is via GUIs
• Benefits of command-line tools:
– Helps you understand GUIs: they show the same metrics
– Often documented, unlike GUI metrics
– Often have useful options not exposed in GUIs
• Installing essential tools (something like):
$ sudo apt-get install sysstat bcc-tools bpftrace linux-tools-common \
linux-tools-$(uname -r) iproute2 msr-tools
$ git clone https://github.com/brendangregg/msr-cloud-tools
$ git clone https://github.com/brendangregg/bpf-perf-tools-book
These are crisis tools and should be installed by default
In a performance meltdown you may be unable to install them
14. uptime
• One way to print load averages:
$ uptime
07:42:06 up
8:16,
1 user,
load average: 2.27, 2.84, 2.91
• A measure of resource demand: CPUs + disks
– Includes TASK_UNINTERRUPTIBLE state to show all demand types
– You can use BPF & off-CPU flame graphs to explain this state:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
– PSI in Linux 4.20 shows CPU, I/O, and memory loads
• Exponentially-damped moving averages
– With time constants of 1, 5, and 15 minutes. See historic trend.
• Load > # of CPUs, may mean CPU saturation
Don’t spend more than 5 seconds studying these
15. top
• System and per-process interval summary:
$ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22
Tasks: 209 total,
1 running, 206 sleeping,
0 stopped,
2 zombie
Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st
Mem: 70197156k total, 44831072k used, 25366084k free,
36360k buffers
Swap:
0k total,
0k used,
0k free, 11873356k cached
PID USER
5738
1386
1
2
[…]
apiprod
apiprod
root
root
PR
20
20
20
20
NI
VIRT
RES
SHR S %CPU %MEM
0 62.6g 29g 352m S
0 17452 1388 964 R
0 24340 2272 1340 S
0
0
0
0 S
417 44.2
0 0.0
0 0.0
0 0.0
TIME+ COMMAND
2144:15
0:00.02
0:01.51
0:00.00 java
top
init
kthreadd
• %CPU is summed across all CPUs
• Can miss short-lived processes (atop won’t)
16. htop
$ htop
1 [||||||||||70.0%]
13 [||||||||||70.6%]
2 [||||||||||68.7%]
14 [||||||||||69.4%]
3 [||||||||||68.2%]
15 [||||||||||68.5%]
4 [||||||||||69.3%]
16 [||||||||||69.2%]
5 [||||||||||68.0%]
17 [||||||||||67.6%]
[…]
Mem[||||||||||||||||||||||||||||||176G/187G]
Swp[
0K/0K]
25
26
27
28
29
[||||||||||69.7%]
[||||||||||67.7%]
[||||||||||68.8%]
[||||||||||67.6%]
[||||||||||70.1%]
[||||||||||66.6%]
[||||||||||66.0%]
[||||||||||73.3%]
[||||||||||67.0%]
[||||||||||66.5%]
Tasks: 80, 3206 thr; 43 running
Load average: 36.95 37.19 38.29
Uptime: 01:39:36
PID USER
PRI NI VIRT
RES
SHR S CPU% MEM%
TIME+
4067 www-data
20
0 202G 173G 55392 S 3359 93.0 48h51:30
6817 www-data
20
0 202G 173G 55392 R 56.9 93.0 48:37.89
6826 www-data
20
0 202G 173G 55392 R 25.7 93.0 22:26.90
6721 www-data
20
0 202G 173G 55392 S 25.0 93.0 22:05.51
6616 www-data
20
0 202G 173G 55392 S 13.6 93.0 11:15.51
[…]
F1Help F2Setup F3SearchF4FilterF5Tree F6SortByF7Nice -F8Nice
•
•
37
38
39
40
41
Command
/apps/java/bin/java
/apps/java/bin/java
/apps/java/bin/java
/apps/java/bin/java
/apps/java/bin/java
+F9Kill
-Dnop
-Dnop
-Dnop
-Dnop
-Dnop
-Djdk.map
-Djdk.map
-Djdk.map
-Djdk.map
-Djdk.map
F10Quit
Pros: configurable. Cons: misleading colors.
dstat is similar, and now dead (May 2019); see pcp-dstat
17. vmstat
• Virtual memory statistics and more:
$ vmstat –Sm 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b
swpd
free
buff cache
si
so
bi
bo
in
cs us sy id wa
8 0
0
1620
149
552
0
0
1
179
77
12 25 34 0 0
7 0
0
1598
149
552
0
0
0
0 205 186 46 13 0 0
8 0
0
1617
149
552
0
0
0
8 210 435 39 21 0 0
8 0
0
1589
149
552
0
0
0
0 218 219 42 17 0 0
[…]
• USAGE: vmstat [interval [count]]
• First output line has some summary since boot values
• High level CPU summary
– “r” is runnable tasks
18. iostat
• Block I/O (disk) stats. 1st output is since boot.
$ iostat -xz 1
Linux 5.0.21 (c099.xxxx)
[...]
Device
r/s
sda
0.01
nvme3n1
19528.04
nvme1n1
18513.51
nvme0n1
16560.88
06/24/19
w/s
rkB/s
0.00
0.16
20.39 293152.56
17.83 286402.15
19.70 258184.52
_x86_64_
wkB/s
0.00
14758.05
13089.56
14218.55
(32 CPU)
rrqm/s
0.00
0.00
0.00
0.00
wrqm/s %rrqm %wrqm
0.00
0.00
0.00
4.72
0.00 18.81
4.05
0.00 18.52
4.78
0.00 19.51
\...
/...
\...
/...
\...
Workload
Very useful
set of stats
...\ r_await w_await aqu-sz rareq-sz wareq-sz
.../
1.90
0.00
0.00
17.01
0.00
...\
0.13
53.56
1.05
15.01
723.80
.../
0.13
49.26
0.85
15.47
734.21
...\
0.13
50.46
0.96
15.59
721.65
Resulting Performance
svctm
1.13
0.02
0.03
0.03
%util
0.00
47.29
48.09
46.64
19. free
• Main memory usage:
$ free -m
Mem:
Swap:
total
23850
31699
used
18248
2021
free
592
29678
shared
3776
buff/cache
5008
• Recently added “available” column
– buff/cache: block device I/O cache + virtual page cache
– available: memory likely available to apps
– free: completely unused memory
available
1432
20. strace
• System call tracer:
$ strace –tttT –p
1408393285.779746
1408393285.779873
1408393285.780797
1408393285.781338
) = 17 <0.000048>
313
getgroups(0, NULL)
= 1 <0.000016>
getgroups(1, [0])
= 1 <0.000015>
close(3)
= 0 <0.000016>
write(1, "wow much syscall\n", 17wow much syscall
• Translates syscall arguments
• Not all kernel requests (e.g., page faults)
• Currently has massive overhead (ptrace based)
– Can slow the target by > 100x. Skews measured time (-ttt, -T).
–
http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html
• perf trace will replace it: uses a ring buffer & BPF
21. tcpdump
• Sniff network packets for post analysis:
$ tcpdump -i eth0 -w /tmp/out.tcpdump
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size
^C7985 packets captured
8996 packets received by filter
1010 packets dropped by kernel
# tcpdump -nr /tmp/out.tcpdump | head
reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet)
20:41:05.038437 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.],
20:41:05.038533 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.],
20:41:05.038584 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.],
[…]
65535 bytes
seq 18...
seq 48...
seq 96...
• Study packet sequences with timestamps (us)
• CPU overhead optimized (socket ring buffers), but can
still be significant. Use BPF in-kernel summaries
instead.
22. nstat
• Replacement for netstat from iproute2
• Various network protocol statistics:
–
–
-s won’t reset counters,
otherwise intervals
can be examined
-d for daemon mode
• Linux keeps adding
more counters
$ nstat -s
#kernel
IpInReceives
IpInDelivers
IpOutRequests
[...]
TcpActiveOpens
TcpPassiveOpens
TcpAttemptFails
TcpEstabResets
TcpInSegs
TcpOutSegs
TcpRetransSegs
TcpOutRsts
[...]
31109659
31109371
33209552 0.0
0.0
0.0
508924
388584
933
1545
31099176
56254112
3762
3183 0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
23. slabtop
• Kernel slab allocator memory usage:
$ slabtop
Active / Total Objects (% used)
Active / Total Slabs (% used)
Active / Total Caches (% used)
Active / Total Size (% used)
Minimum / Average / Maximum Object
:
:
:
:
:
4692768 / 4751161 (98.8%)
129083 / 129083 (100.0%)
71 / 109 (65.1%)
729966.22K / 738277.47K (98.9%)
0.01K / 0.16K / 8.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
3565575 3565575 100%
0.10K 91425
39
365700K buffer_head
314916 314066 99%
0.19K 14996
21
59984K dentry
184192 183751 99%
0.06K
2878
64
11512K kmalloc-64
138618 138618 100%
0.94K
4077
34
130464K xfs_inode
138602 138602 100%
0.21K
3746
37
29968K xfs_ili
102116 99012 96%
0.55K
3647
28
58352K radix_tree_node
97482 49093 50%
0.09K
2321
42
9284K kmalloc-96
22695 20777 91%
0.05K
267
85
1068K shared_policy_node
21312 21312 100%
0.86K
576
37
18432K ext4_inode_cache
16288 14601 89%
0.25K
509
32
4072K kmalloc-256
[…]
24. pcstat
• Show page cache residency by file:
# ./pcstat data0*
|----------+----------------+------------+-----------+---------|
| Name
| Size
| Pages
| Cached
| Percent |
|----------+----------------+------------+-----------+---------|
| data00
| 104857600
| 25600
| 25600
| 100.000 |
| data01
| 104857600
| 25600
| 25600
| 100.000 |
| data02
| 104857600
| 25600
| 4080
| 015.938 |
| data03
| 104857600
| 25600
| 25600
| 100.000 |
| data04
| 104857600
| 25600
| 16010
| 062.539 |
| data05
| 104857600
| 25600
| 0
| 000.000 |
|----------+----------------+------------+-----------+---------|
• Uses mincore(2) syscall. Used for database perf analysis.
25. docker stats
• Soft limits (cgroups) by container:
# docker stats
CONTAINER
CPU %
353426a09db1 526.81%
6bf166a66e08 303.82%
58dcf8aed0a7 41.01%
61061566ffe5 85.92%
bdc721460293 2.69%
6c80ed61ae63 477.45%
337292fb5b64 89.05%
b652ede9a605 173.50%
d7cd2599291f 504.28%
05bf9f3e0d13 314.46%
09082f005755 142.04%
[...]
MEM USAGE
4.061 GiB
3.448 GiB
1.322 GiB
220.9 MiB
1.204 GiB
557.7 MiB
766.2 MiB
689.2 MiB
673.2 MiB
711.6 MiB
693.9 MiB
/
/
/
/
/
/
/
/
/
/
/
/
LIMIT
8.5 GiB
8.5 GiB
2.5 GiB
3.023 GiB
3.906 GiB
8 GiB
8 GiB
8 GiB
8 GiB
8 GiB
8 GiB
MEM %
47.78%
40.57%
52.89%
7.14%
30.82%
6.81%
9.35%
8.41%
8.22%
8.69%
8.47%
NET
0 B
0 B
0 B
0 B
0 B
0 B
0 B
0 B
0 B
0 B
0 B
I/O
/ 0
/ 0
/ 0
/ 0
/ 0
/ 0
/ 0
/ 0
/ 0
/ 0
/ 0
B
B
B
B
B
B
B
B
B
B
B
BLOCK I/O
2.818 MB / 0 B
2.032 MB / 0 B
0 B / 0 B
43.4 MB / 0 B
4.35 MB / 0 B
9.257 MB / 0 B
5.493 MB / 0 B
6.48 MB / 0 B
12.58 MB / 0 B
7.942 MB / 0 B
8.081 MB / 0 B
PIDS
247
267
229
61
66
19
19
19
19
19
19
• Stats are in /sys/fs/cgroups
• CPU shares and bursting breaks monitoring assumptions
26. showboost
• Determine current CPU clock rate
# showboost
Base CPU MHz : 2500
Set CPU MHz : 2500
Turbo MHz(s) : 3100 3200 3300 3500
Turbo Ratios : 124% 128% 132% 140%
CPU 0 summary every 1 seconds...
TIME
23:39:07
23:39:08
23:39:09
^C
C0_MCYC
1618910294
1774059258
2476365498
C0_ACYC
89419923
97132588
130869241
UTIL
64%
70%
99%
RATIO
5%
5%
5%
MHz
138
136
132
• Uses MSRs. Can also use PMCs for this.
• Also see turbostat.
https://github.com/brendangregg/msr-cloud-tools
27. Also: Static Performance Tuning Tools
28. Where do you start...and stop?
Workload Observability
Static Configuration
29. 2. Methodologies
30. Anti-Methodologies
• The lack of a deliberate methodology…
• Street Light Anti-Method:
– 1. Pick observability tools that are
• Familiar
• Found on the Internet
• Found at random
– 2. Run tools
– 3. Look for obvious issues
• Drunk Man Anti-Method:
– Tune things at random until the problem goes away
31. Methodologies
•
•
•
•
Linux Performance Analysis in 60 seconds
The USE method
Workload characterization
Many others:
–
–
–
–
–
–
–
…
Resource analysis
Workload analysis
Drill-down analysis
CPU profile method
Off-CPU analysis
Static performance tuning
5 whys
32. Linux Perf Analysis in 60s
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
uptime
dmesg -T | tail
vmstat 1
mpstat -P ALL 1
pidstat 1
iostat -xz 1
free -m
sar -n DEV 1
sar -n TCP,ETCP 1
top
load averages
kernel errors
overall stats by time
CPU balance
process usage
disk I/O
memory usage
network I/O
TCP stats
check overview
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
33. USE Method
For every resource, check:
1. Utilization
2. Saturation
3. Errors
Saturation
Errors
X
Resource
Utilization
(%)
For example, CPUs:
-
-
-
Utilization: time busy
Saturation: run queue length or latency
Errors: ECC errors, etc.
Start with the questions,
then find the tools
Can be applied to hardware and software (cgroups)
34. Workload Characterization
Analyze workload characteristics, not resulting performance
For example, CPUs:
1. Who: which PIDs, programs, users
2. Why: code paths, context
3. What: CPU instructions, cycles
4. How: changing over time
Workload
Target
35. 3. Benchmarking
36. ~100% of benchmarks are wrong
The energy needed to refute benchmarks
is orders of magnitude bigger than
to run them (so, no one does)
37. Benchmarking
• An experimental analysis activity
– Try observational analysis first; benchmarks can perturb
• Benchmarking is error prone:
– Testing the wrong target
• eg, FS cache I/O instead of disk I/O
– Choosing the wrong target
• eg, disk I/O instead of FS cache I/O
– Invalid results
• eg, bugs
– Misleading results:
• you benchmark A,
but actually measure B,
and conclude you measured C
caution: benchmarking
38. Benchmark Examples
• Micro benchmarks:
– File system maximum cached read operations/sec
– Network maximum throughput
• Macro (application) benchmarks:
– Simulated application max request rate
• Bad benchmarks:
– gitpid() in a tight loop
– Context switch timing
kitchen sink benchmarks
39. If your product’s chances of
winning a benchmark are
50/50, you’ll usually lose
Benchmark paradox
caution: despair
http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html
40. Solution: Active Benchmarking
• Root cause analysis while the benchmark runs
– Use the earlier observability tools
– Identify the limiter (or suspect) and include it with the results
• For any given benchmark, ask: why not 10x?
• This takes time, but uncovers most mistakes
41. 4. Profiling
42. Profiling
Can you do this?
“As an experiment to investigate the performance of the resulting TCP/IP
implementation ... the 11/750 is CPU saturated, but the 11/780 has about
30% idle time. The time spent in the system processing the data is spread
out among handling for the Ethernet (20%), IP packet processing (10%),
TCP processing (30%), checksumming (25%), and user system call
handling (15%), with no single part of the handling dominating the time in
the system.”
– Bill Joy, 1981, TCP-IP Digest, Vol 1 #6
https://www.rfc-editor.org/rfc/museum/tcp-ip-digest/tcp-ip-digest.v1n6.1
43. perf: CPU profiling
• Sampling full stack traces at 99 Hertz, for 30 secs:
# perf record -F 99 -ag -- sleep 30
[ perf record: Woken up 9 times to write data ]
[ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ]
# perf report -n --stdio
1.40%
162
java [kernel.kallsyms]
[k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--63.21%-- try_to_wake_up
|
|
|
|--63.91%-- default_wake_function
|
|
|
|
|
|--56.11%-- __wake_up_common
|
|
|
__wake_up_locked
|
|
|
ep_poll_callback
|
|
|
__wake_up_common
|
|
|
__wake_up_sync_key
|
|
|
|
|
|
|
|--59.19%-- sock_def_readable
[…78,000 lines truncated…]
44. Full "perf report" Output
45. … as a Flame Graph
46. Flame Graphs
• Visualizes a collection of stack traces
– x-axis: alphabetical stack sort, to maximize merging
– y-axis: stack depth
– color: random (default), or a dimension
• Perl + SVG + JavaScript
– https://github.com/brendangregg/FlameGraph
– Takes input from many different profilers
– Multiple d3 versions are being developed
• References:
– http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
– http://queue.acm.org/detail.cfm?id=2927301
– "The Flame Graph" CACM, June 2016
47. Linux CPU Flame Graphs
Linux 2.6+, via perf:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
cd FlameGraph
These files can be read using FlameScope
perf record -F 99 -a –g -- sleep 30
perf script --header > out.perf01
./stackcollapse-perf.pl < out.perf01 |./flamegraph.pl > perf.svg
Linux 4.9+, via BPF:
git clone --depth 1 https://github.com/brendangregg/FlameGraph
git clone --depth 1 https://github.com/iovisor/bcc
./bcc/tools/profile.py -dF 99 30 | ./FlameGraph/flamegraph.pl > perf.svg
– Most efficient: no perf.data file, summarizes in-kernel
48. FlameScope
●
Analyze variance, perturbations
Flame graph
https://github.com/
Netflix/flamescope
Subsecond-offset heat map
49. perf: Counters
• Performance Monitoring Counters (PMCs):
•
$ perf list | grep –i hardware
cpu-cycles OR cycles
stalled-cycles-frontend OR idle-cycles-frontend
stalled-cycles-backend OR idle-cycles-backend
instructions
[…]
L1-dcache-loads
L1-dcache-load-misses
[…]
rNNN (see 'perf list --help' on how to encode it)
mem:<addr>[:access]
[Hardware
[Hardware
[Hardware
[Hardware
event]
event]
event]
event]
[Hardware cache event]
[Hardware cache event]
[Raw hardware event …
[Hardware breakpoint]
• Measure instructions-per-cycle (IPC) and CPU stall types
• PMCs only enabled for some cloud instance types
My front-ends, incl. pmcarch:
https://github.com/brendangregg/pmc-cloud-tools
50. 5. Tracing
51. Linux Tracing Events
52. Tracing Stack
add-on tools:
front-end tools:
tracing frameworks:
back-end instrumentation:
trace-cmd, perf-tools, bcc, bpftrace
perf
Ftrace, perf_events, BPF
tracepoints, kprobes, uprobes
BPF enables a new class of
custom, efficient, and production safe
performance analysis tools
in
Linux
53. Ftrace: perf-tools funccount
• Built-in kernel tracing capabilities, added by Steven
Rostedt and others since Linux 2.6.27
# ./funccount -i 1 'bio_*'
Tracing "bio_*"... Ctrl-C to end.
FUNC
[...]
bio_alloc_bioset
bio_endio
bio_free
bio_fs_destructor
bio_init
bio_integrity_enabled
bio_put
bio_add_page
• Also see trace-cmd
COUNT
536
536
536
536
536
536
729
1004
54. perf: Tracing Tracepoints
●
perf was introduced earlier; it is also a powerful tracer
# perf stat -e block:block_rq_complete -a sleep 10
Performance counter stats for 'system wide':
91
#
[
[
#
perf
perf
perf
perf
In-kernel counts (efficient)
block:block_rq_complete
record -e block:block_rq_complete -a sleep 10
record: Woken up 1 times to write data ]
record: Captured and wrote 0.428 MB perf.data (~18687 samples)
script
run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1
run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1
swapper
0 [000] 2083345.723489: block:block_rq_complete: 202,1
swapper
0 [000] 2083346.745840: block:block_rq_complete: 202,1
supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1
[...]
Dump & post-process
]
W () 12986336
W () 12986528
W () 12986496
WS () 1052984
WS () 1053128
http://www.brendangregg.com/perf.html
https://perf.wiki.kernel.org/index.php/Main_Page
+
+
+
+
+
8 [0]
8 [0]
8 [0]
144 [0]
8 [0]
55. BCC/BPF: ext4slower
• ext4 operations slower than the threshold:
# ./ext4slower 1
Tracing ext4 operations
TIME
COMM
06:49:17 bash
06:49:17 cksum
06:49:17 cksum
06:49:17 cksum
06:49:17 cksum
06:49:17 cksum
06:49:17 cksum
06:49:17 cksum
[…]
slower
PID
3616
3616
3616
3616
3616
3616
3616
3616
than 1 ms
T BYTES
R 128
R 39552
R 96
R 96
R 10320
R 65536
R 55400
R 36792
OFF_KB
0
0
0
0
0
0
0
0
LAT(ms)
7.75
1.34
5.36
14.94
6.82
4.01
8.77
16.34
FILENAME
cksum
[
2to3-2.7
2to3-3.4
411toppm
a2p
ab
aclocal-1.14
• Better indicator of application pain than disk I/O
• Measures & filters in-kernel for efficiency using BPF
https://github.com/iovisor/bcc
56. bpftrace: one-liners
• Block I/O (disk) events by type; by size & comm:
# bpftrace -e 't:block:block_rq_issue { @[args->rwbs] = count(); }'
Attaching 1 probe...
^C
@[WS]: 2
@[RM]: 12
@[RA]: 1609
@[R]: 86421
# bpftrace -e 't:block:block_rq_issue { @bytes[comm] = hist(args->bytes); }'
Attaching 1 probe...
^C
@bytes[dmcrypt_write]:
[4K, 8K)
68 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)
35 |@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[16K, 32K)
4 |@@@
|
[32K, 64K)
1 |
|
[64K, 128K)
2 |@
|
https://github.com/iovisor/bpftrace
[...]
57. BPF Perf
Tools
(2019)
BCC & bpftrace repos
contain many of these.
The book has them all.
58. Off-CPU Analysis
• Explain all blocking events. High-overhead: needs BPF.
directory read
from disk
file read
from disk
fstat from disk
path read from disk
pipe write
59. 6. Tuning
60. Ubuntu Bionic Tuning: Late 2019 (1/2)
•
•
•
•
•
CPU
schedtool –B PID
disable Ubuntu apport (crash reporter)
upgrade to Bionic (scheduling improvements)
Virtual Memory
vm.swappiness = 0
# from 60
Memory
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
kernel.numa_balancing = 0
File System
vm.dirty_ratio = 80
# from 40
vm.dirty_background_ratio = 5
# from 10
vm.dirty_expire_centisecs = 12000
# from 3000
mount -o defaults,noatime,discard,nobarrier …
Storage I/O
/sys/block/*/queue/rq_affinity
1
# or 2
/sys/block/*/queue/scheduler
kyber
/sys/block/*/queue/nr_requests
256
/sys/block/*/queue/read_ahead_kb 128
mdadm –chunk=64 …
61. Ubuntu Bionic Tuning: Late 2019 (2/2)
•
•
•
Networking
net.core.default_qdisc = fq
net.core.netdev_max_backlog = 5000
net.core.rmem_max = 16777216
net.core.somaxconn = 1024
net.core.wmem_max = 16777216
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_abort_on_overflow = 1
# maybe
net.ipv4.tcp_congestion_control = bbr
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_rmem = 4096 12582912 16777216
# or 8388608 ...
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_wmem = 4096 12582912 16777216
# or 8388608 ...
Hypervisor
echo tsc > /sys/devices/…/current_clocksource
Plus use AWS Nitro
Other
net.core.bpf_jit_enable = 1
sysctl -w kernel.perf_event_max_stack=1000
62. Takeaways
Systems Performance is:
Observability, Methodologies, Benchmarking, Profiling, Tracing, Tuning
Print out for your office wall:
1. uptime
2. dmesg -T | tail
3. vmstat 1
4. mpstat -P ALL 1
5. pidstat 1
6. iostat -xz 1
7. free -m
8. sar -n DEV 1
9. sar -n TCP,ETCP 1
10. top
63. Links
Netflix Tech Blog on Linux:
●
http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html
●
http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html
Linux Performance:
●
http://www.brendangregg.com/linuxperf.html
Linux perf:
●
https://perf.wiki.kernel.org/index.php/Main_Page
●
http://www.brendangregg.com/perf.html
Linux ftrace:
●
https://www.kernel.org/doc/Documentation/trace/ftrace.txt
●
https://github.com/brendangregg/perf-tools
Linux BPF:
●
http://www.brendangregg.com/ebpf.html
●
http://www.brendangregg.com/bpf-performance-tools-book.html
●
https://github.com/iovisor/bcc
●
https://github.com/iovisor/bpftrace
Methodologies:
●
http://www.brendangregg.com/USEmethod/use-linux.html
●
http://www.brendangregg.com/activebenchmarking.html
Flame Graphs & FlameScope:
●
http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
●
http://queue.acm.org/detail.cfm?id=2927301
●
https://github.com/Netflix/flamescope
MSRs and PMCs
●
https://github.com/brendangregg/msr-cloud-tools
●
https://github.com/brendangregg/pmc-cloud-tools
BPF Performance Tools
64. Thanks
•
•
•
•
•
Questions?
http://slideshare.net/brendangregg
http://www.brendangregg.com
bgregg@netflix.com
@brendangregg
Look out for 2 nd Ed.
USENIX LISA 2019, Portland, Oct 28-30