LISA2019 Linux Systems Performance

如果无法正常显示，请先停止浏览器的去广告插件。

1. Oct, 2019 Linux Systems Performance Brendan Gregg Senior Performance Engineer USENIX LISA 2019, Portland, Oct 28-30

2. Experience: A 3x Perf Difference

3. mpstat load averages: serverA 90, serverB 17 serverA# mpstat 10 Linux 4.4.0-130-generic (serverA) 07/18/2019 _x86_64_ (48 CPU) 10:07:55 10:08:05 10:08:15 10:08:25 [...] Average: PM PM PM PM CPU all all all %usr 89.72 88.60 89.71 %nice 0.00 0.00 0.00 all 89.49 0.00 %sys %iowait 7.84 0.00 9.18 0.00 9.01 0.00 %irq 0.00 0.00 0.00 %soft 0.04 0.05 0.05 %steal 0.00 0.00 0.00 %guest 0.00 0.00 0.00 %gnice 0.00 0.00 0.00 %idle 2.40 2.17 1.23 8.47 0.00 0.05 0.00 0.00 0.00 1.99 0.00 serverB# mpstat 10 Linux 4.19.26-nflx (serverB) 07/18/2019 _x86_64_ (64 CPU) 09:56:11 09:56:21 09:56:31 09:56:41 [...] Average: PM PM PM PM CPU all all all %usr 23.21 20.21 21.58 %nice 0.01 0.00 0.00 all 21.50 0.00 %sys %iowait 0.32 0.00 0.38 0.00 0.39 0.00 %irq 0.00 0.00 0.00 %soft 0.10 0.08 0.10 %steal 0.00 0.00 0.00 %guest 0.00 0.00 0.00 %gnice 0.00 0.00 0.00 %idle 76.37 79.33 77.92 0.36 0.00 0.09 0.00 0.00 0.00 78.04 0.00

4. pmcarch serverA# ./pmcarch -p 4093 10 K_CYCLES K_INSTR IPC BR_RETIRED 982412660 575706336 0.59 126424862460 999621309 555043627 0.56 120449284756 991146940 558145849 0.56 126350181501 996314688 562276830 0.56 122215605985 979890037 560268707 0.57 125609807909 ^C serverB# ./pmcarch -p 1928219 10 K_CYCLES K_INSTR IPC BR_RETIRED 147523816 222396364 1.51 46053921119 156634810 229801807 1.47 48236123575 152783226 237001219 1.55 49344315621 140787179 213570329 1.52 44518363978 136822760 219706637 1.61 45129020910 BR_MISPRED 2416880487 2317302514 2530383860 2348638980 2386085660 BMR% 1.91 1.92 2.00 1.92 1.90 LLCREF 15724006692 15378257714 15965082710 15558286345 15828820588 LLCMISS 10872315070 11121882510 11464682655 10835594199 11038597030 BR_MISPRED 641813770 653064504 692819230 631588112 651436401 BMR% 1.39 1.35 1.40 1.42 1.44 LLCREF 8880477235 9186609260 9314992450 8675999448 8689831639 LLCMISS 968809014 1183858023 879494418 712318917 617678747 LLC% 30.86 27.68 28.19 30.35 30.26 LLC% 89.09 87.11 90.56 91.79 92.89

5. perf serverA# perf stat -e cs -a -I 1000 # time counts unit events 1.000411740 2,063,105 cs 2.000977435 2,065,354 cs 3.001537756 1,527,297 cs 4.002028407 515,509 cs 5.002538455 2,447,126 cs [...] serverB# perf stat -e cs -p 1928219 -I 1000 # time counts unit events 1.001931945 1,172 cs 2.002664012 1,370 cs 3.003441563 1,034 cs 4.004140394 1,207 cs 5.004947675 1,053 cs [...]

6. bcc/BPF serverA# /usr/share/bcc/tools/cpudist -p 4093 10 1 Tracing on-CPU time... Hit Ctrl-C to end. usecs 0 2 4 8 16 32 -> -> -> -> -> -> 1 3 7 15 31 63 : : : : : : : count 3618650 2704935 421179 99416 16951 6355 distribution |****************************************| |***************************** | |**** | |* | | | | | [...] serverB# /usr/share/bcc/tools/cpudist -p 1928219 10 1 Tracing on-CPU time... Hit Ctrl-C to end. usecs 256 512 1024 2048 4096 8192 16384 [...] -> -> -> -> -> -> -> 511 1023 2047 4095 8191 16383 32767 : : : : : : : : count 44 156 238 4511 277 286 77 distribution | | |* | |** | |****************************************| |** | |** | | |

7. Systems Performance in 45 mins • This is slides + discussion • For more detail and stand-alone texts:

8. Agenda 1. 2. 3. 4. 5. 6. Observability Methodologies Benchmarking Profiling Tracing Tuning

10. 1. Observability

11. How do you measure these?

12. Linux Observability Tools

13. Why Learn Tools? • Most analysis at Netflix is via GUIs • Benefits of command-line tools: – Helps you understand GUIs: they show the same metrics – Often documented, unlike GUI metrics – Often have useful options not exposed in GUIs • Installing essential tools (something like): $ sudo apt-get install sysstat bcc-tools bpftrace linux-tools-common \ linux-tools-$(uname -r) iproute2 msr-tools $ git clone https://github.com/brendangregg/msr-cloud-tools $ git clone https://github.com/brendangregg/bpf-perf-tools-book These are crisis tools and should be installed by default In a performance meltdown you may be unable to install them

14. uptime • One way to print load averages: $ uptime 07:42:06 up 8:16, 1 user, load average: 2.27, 2.84, 2.91 • A measure of resource demand: CPUs + disks – Includes TASK_UNINTERRUPTIBLE state to show all demand types – You can use BPF & off-CPU flame graphs to explain this state: http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html – PSI in Linux 4.20 shows CPU, I/O, and memory loads • Exponentially-damped moving averages – With time constants of 1, 5, and 15 minutes. See historic trend. • Load > # of CPUs, may mean CPU saturation Don’t spend more than 5 seconds studying these

15. top • System and per-process interval summary: $ top - 18:50:26 up 7:43, 1 user, load average: 4.11, 4.91, 5.22 Tasks: 209 total, 1 running, 206 sleeping, 0 stopped, 2 zombie Cpu(s): 47.1%us, 4.0%sy, 0.0%ni, 48.4%id, 0.0%wa, 0.0%hi, 0.3%si, 0.2%st Mem: 70197156k total, 44831072k used, 25366084k free, 36360k buffers Swap: 0k total, 0k used, 0k free, 11873356k cached PID USER 5738 1386 1 2 […] apiprod apiprod root root PR 20 20 20 20 NI VIRT RES SHR S %CPU %MEM 0 62.6g 29g 352m S 0 17452 1388 964 R 0 24340 2272 1340 S 0 0 0 0 S 417 44.2 0 0.0 0 0.0 0 0.0 TIME+ COMMAND 2144:15 0:00.02 0:01.51 0:00.00 java top init kthreadd • %CPU is summed across all CPUs • Can miss short-lived processes (atop won’t)

16. htop $ htop 1 [||||||||||70.0%] 13 [||||||||||70.6%] 2 [||||||||||68.7%] 14 [||||||||||69.4%] 3 [||||||||||68.2%] 15 [||||||||||68.5%] 4 [||||||||||69.3%] 16 [||||||||||69.2%] 5 [||||||||||68.0%] 17 [||||||||||67.6%] […] Mem[||||||||||||||||||||||||||||||176G/187G] Swp[ 0K/0K] 25 26 27 28 29 [||||||||||69.7%] [||||||||||67.7%] [||||||||||68.8%] [||||||||||67.6%] [||||||||||70.1%] [||||||||||66.6%] [||||||||||66.0%] [||||||||||73.3%] [||||||||||67.0%] [||||||||||66.5%] Tasks: 80, 3206 thr; 43 running Load average: 36.95 37.19 38.29 Uptime: 01:39:36 PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ 4067 www-data 20 0 202G 173G 55392 S 3359 93.0 48h51:30 6817 www-data 20 0 202G 173G 55392 R 56.9 93.0 48:37.89 6826 www-data 20 0 202G 173G 55392 R 25.7 93.0 22:26.90 6721 www-data 20 0 202G 173G 55392 S 25.0 93.0 22:05.51 6616 www-data 20 0 202G 173G 55392 S 13.6 93.0 11:15.51 […] F1Help F2Setup F3SearchF4FilterF5Tree F6SortByF7Nice -F8Nice • • 37 38 39 40 41 Command /apps/java/bin/java /apps/java/bin/java /apps/java/bin/java /apps/java/bin/java /apps/java/bin/java +F9Kill -Dnop -Dnop -Dnop -Dnop -Dnop -Djdk.map -Djdk.map -Djdk.map -Djdk.map -Djdk.map F10Quit Pros: configurable. Cons: misleading colors. dstat is similar, and now dead (May 2019); see pcp-dstat

17. vmstat • Virtual memory statistics and more: $ vmstat –Sm 1 procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 8 0 0 1620 149 552 0 0 1 179 77 12 25 34 0 0 7 0 0 1598 149 552 0 0 0 0 205 186 46 13 0 0 8 0 0 1617 149 552 0 0 0 8 210 435 39 21 0 0 8 0 0 1589 149 552 0 0 0 0 218 219 42 17 0 0 […] • USAGE: vmstat [interval [count]] • First output line has some summary since boot values • High level CPU summary – “r” is runnable tasks

18. iostat • Block I/O (disk) stats. 1st output is since boot. $ iostat -xz 1 Linux 5.0.21 (c099.xxxx) [...] Device r/s sda 0.01 nvme3n1 19528.04 nvme1n1 18513.51 nvme0n1 16560.88 06/24/19 w/s rkB/s 0.00 0.16 20.39 293152.56 17.83 286402.15 19.70 258184.52 _x86_64_ wkB/s 0.00 14758.05 13089.56 14218.55 (32 CPU) rrqm/s 0.00 0.00 0.00 0.00 wrqm/s %rrqm %wrqm 0.00 0.00 0.00 4.72 0.00 18.81 4.05 0.00 18.52 4.78 0.00 19.51 \... /... \... /... \... Workload Very useful set of stats ...\ r_await w_await aqu-sz rareq-sz wareq-sz .../ 1.90 0.00 0.00 17.01 0.00 ...\ 0.13 53.56 1.05 15.01 723.80 .../ 0.13 49.26 0.85 15.47 734.21 ...\ 0.13 50.46 0.96 15.59 721.65 Resulting Performance svctm 1.13 0.02 0.03 0.03 %util 0.00 47.29 48.09 46.64

19. free • Main memory usage: $ free -m Mem: Swap: total 23850 31699 used 18248 2021 free 592 29678 shared 3776 buff/cache 5008 • Recently added “available” column – buff/cache: block device I/O cache + virtual page cache – available: memory likely available to apps – free: completely unused memory available 1432

20. strace • System call tracer: $ strace –tttT –p 1408393285.779746 1408393285.779873 1408393285.780797 1408393285.781338 ) = 17 <0.000048> 313 getgroups(0, NULL) = 1 <0.000016> getgroups(1, [0]) = 1 <0.000015> close(3) = 0 <0.000016> write(1, "wow much syscall\n", 17wow much syscall • Translates syscall arguments • Not all kernel requests (e.g., page faults) • Currently has massive overhead (ptrace based) – Can slow the target by > 100x. Skews measured time (-ttt, -T). – http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html • perf trace will replace it: uses a ring buffer & BPF

21. tcpdump • Sniff network packets for post analysis: $ tcpdump -i eth0 -w /tmp/out.tcpdump tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size ^C7985 packets captured 8996 packets received by filter 1010 packets dropped by kernel # tcpdump -nr /tmp/out.tcpdump | head reading from file /tmp/out.tcpdump, link-type EN10MB (Ethernet) 20:41:05.038437 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.], 20:41:05.038533 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.], 20:41:05.038584 IP 10.44.107.151.22 > 10.53.237.72.46425: Flags [P.], […] 65535 bytes seq 18... seq 48... seq 96... • Study packet sequences with timestamps (us) • CPU overhead optimized (socket ring buffers), but can still be significant. Use BPF in-kernel summaries instead.

22. nstat • Replacement for netstat from iproute2 • Various network protocol statistics: – – -s won’t reset counters, otherwise intervals can be examined -d for daemon mode • Linux keeps adding more counters $ nstat -s #kernel IpInReceives IpInDelivers IpOutRequests [...] TcpActiveOpens TcpPassiveOpens TcpAttemptFails TcpEstabResets TcpInSegs TcpOutSegs TcpRetransSegs TcpOutRsts [...] 31109659 31109371 33209552 0.0 0.0 0.0 508924 388584 933 1545 31099176 56254112 3762 3183 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

23. slabtop • Kernel slab allocator memory usage: $ slabtop Active / Total Objects (% used) Active / Total Slabs (% used) Active / Total Caches (% used) Active / Total Size (% used) Minimum / Average / Maximum Object : : : : : 4692768 / 4751161 (98.8%) 129083 / 129083 (100.0%) 71 / 109 (65.1%) 729966.22K / 738277.47K (98.9%) 0.01K / 0.16K / 8.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 3565575 3565575 100% 0.10K 91425 39 365700K buffer_head 314916 314066 99% 0.19K 14996 21 59984K dentry 184192 183751 99% 0.06K 2878 64 11512K kmalloc-64 138618 138618 100% 0.94K 4077 34 130464K xfs_inode 138602 138602 100% 0.21K 3746 37 29968K xfs_ili 102116 99012 96% 0.55K 3647 28 58352K radix_tree_node 97482 49093 50% 0.09K 2321 42 9284K kmalloc-96 22695 20777 91% 0.05K 267 85 1068K shared_policy_node 21312 21312 100% 0.86K 576 37 18432K ext4_inode_cache 16288 14601 89% 0.25K 509 32 4072K kmalloc-256 […]

24. pcstat • Show page cache residency by file: # ./pcstat data0* |----------+----------------+------------+-----------+---------| | Name | Size | Pages | Cached | Percent | |----------+----------------+------------+-----------+---------| | data00 | 104857600 | 25600 | 25600 | 100.000 | | data01 | 104857600 | 25600 | 25600 | 100.000 | | data02 | 104857600 | 25600 | 4080 | 015.938 | | data03 | 104857600 | 25600 | 25600 | 100.000 | | data04 | 104857600 | 25600 | 16010 | 062.539 | | data05 | 104857600 | 25600 | 0 | 000.000 | |----------+----------------+------------+-----------+---------| • Uses mincore(2) syscall. Used for database perf analysis.

25. docker stats • Soft limits (cgroups) by container: # docker stats CONTAINER CPU % 353426a09db1 526.81% 6bf166a66e08 303.82% 58dcf8aed0a7 41.01% 61061566ffe5 85.92% bdc721460293 2.69% 6c80ed61ae63 477.45% 337292fb5b64 89.05% b652ede9a605 173.50% d7cd2599291f 504.28% 05bf9f3e0d13 314.46% 09082f005755 142.04% [...] MEM USAGE 4.061 GiB 3.448 GiB 1.322 GiB 220.9 MiB 1.204 GiB 557.7 MiB 766.2 MiB 689.2 MiB 673.2 MiB 711.6 MiB 693.9 MiB / / / / / / / / / / / / LIMIT 8.5 GiB 8.5 GiB 2.5 GiB 3.023 GiB 3.906 GiB 8 GiB 8 GiB 8 GiB 8 GiB 8 GiB 8 GiB MEM % 47.78% 40.57% 52.89% 7.14% 30.82% 6.81% 9.35% 8.41% 8.22% 8.69% 8.47% NET 0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B 0 B I/O / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 / 0 B B B B B B B B B B B BLOCK I/O 2.818 MB / 0 B 2.032 MB / 0 B 0 B / 0 B 43.4 MB / 0 B 4.35 MB / 0 B 9.257 MB / 0 B 5.493 MB / 0 B 6.48 MB / 0 B 12.58 MB / 0 B 7.942 MB / 0 B 8.081 MB / 0 B PIDS 247 267 229 61 66 19 19 19 19 19 19 • Stats are in /sys/fs/cgroups • CPU shares and bursting breaks monitoring assumptions

26. showboost • Determine current CPU clock rate # showboost Base CPU MHz : 2500 Set CPU MHz : 2500 Turbo MHz(s) : 3100 3200 3300 3500 Turbo Ratios : 124% 128% 132% 140% CPU 0 summary every 1 seconds... TIME 23:39:07 23:39:08 23:39:09 ^C C0_MCYC 1618910294 1774059258 2476365498 C0_ACYC 89419923 97132588 130869241 UTIL 64% 70% 99% RATIO 5% 5% 5% MHz 138 136 132 • Uses MSRs. Can also use PMCs for this. • Also see turbostat. https://github.com/brendangregg/msr-cloud-tools

27. Also: Static Performance Tuning Tools

28. Where do you start...and stop? Workload Observability Static Configuration

29. 2. Methodologies

30. Anti-Methodologies • The lack of a deliberate methodology… • Street Light Anti-Method: – 1. Pick observability tools that are • Familiar • Found on the Internet • Found at random – 2. Run tools – 3. Look for obvious issues • Drunk Man Anti-Method: – Tune things at random until the problem goes away

31. Methodologies • • • • Linux Performance Analysis in 60 seconds The USE method Workload characterization Many others: – – – – – – – … Resource analysis Workload analysis Drill-down analysis CPU profile method Off-CPU analysis Static performance tuning 5 whys

32. Linux Perf Analysis in 60s 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. uptime dmesg -T | tail vmstat 1 mpstat -P ALL 1 pidstat 1 iostat -xz 1 free -m sar -n DEV 1 sar -n TCP,ETCP 1 top load averages kernel errors overall stats by time CPU balance process usage disk I/O memory usage network I/O TCP stats check overview http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html

33. USE Method For every resource, check: 1. Utilization 2. Saturation 3. Errors Saturation Errors X Resource Utilization (%) For example, CPUs: - - - Utilization: time busy Saturation: run queue length or latency Errors: ECC errors, etc. Start with the questions, then find the tools Can be applied to hardware and software (cgroups)

34. Workload Characterization Analyze workload characteristics, not resulting performance For example, CPUs: 1. Who: which PIDs, programs, users 2. Why: code paths, context 3. What: CPU instructions, cycles 4. How: changing over time Workload Target

35. 3. Benchmarking

36. ~100% of benchmarks are wrong The energy needed to refute benchmarks is orders of magnitude bigger than to run them (so, no one does)

37. Benchmarking • An experimental analysis activity – Try observational analysis first; benchmarks can perturb • Benchmarking is error prone: – Testing the wrong target • eg, FS cache I/O instead of disk I/O – Choosing the wrong target • eg, disk I/O instead of FS cache I/O – Invalid results • eg, bugs – Misleading results: • you benchmark A, but actually measure B, and conclude you measured C caution: benchmarking

38. Benchmark Examples • Micro benchmarks: – File system maximum cached read operations/sec – Network maximum throughput • Macro (application) benchmarks: – Simulated application max request rate • Bad benchmarks: – gitpid() in a tight loop – Context switch timing kitchen sink benchmarks

39. If your product’s chances of winning a benchmark are 50/50, you’ll usually lose Benchmark paradox caution: despair http://www.brendangregg.com/blog/2014-05-03/the-benchmark-paradox.html

40. Solution: Active Benchmarking • Root cause analysis while the benchmark runs – Use the earlier observability tools – Identify the limiter (or suspect) and include it with the results • For any given benchmark, ask: why not 10x? • This takes time, but uncovers most mistakes

41. 4. Profiling

42. Profiling Can you do this? “As an experiment to investigate the performance of the resulting TCP/IP implementation ... the 11/750 is CPU saturated, but the 11/780 has about 30% idle time. The time spent in the system processing the data is spread out among handling for the Ethernet (20%), IP packet processing (10%), TCP processing (30%), checksumming (25%), and user system call handling (15%), with no single part of the handling dominating the time in the system.” – Bill Joy, 1981, TCP-IP Digest, Vol 1 #6 https://www.rfc-editor.org/rfc/museum/tcp-ip-digest/tcp-ip-digest.v1n6.1

43. perf: CPU profiling • Sampling full stack traces at 99 Hertz, for 30 secs: # perf record -F 99 -ag -- sleep 30 [ perf record: Woken up 9 times to write data ] [ perf record: Captured and wrote 2.745 MB perf.data (~119930 samples) ] # perf report -n --stdio 1.40% 162 java [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--63.21%-- try_to_wake_up | | | |--63.91%-- default_wake_function | | | | | |--56.11%-- __wake_up_common | | | __wake_up_locked | | | ep_poll_callback | | | __wake_up_common | | | __wake_up_sync_key | | | | | | | |--59.19%-- sock_def_readable […78,000 lines truncated…]

44. Full "perf report" Output

45. … as a Flame Graph

46. Flame Graphs • Visualizes a collection of stack traces – x-axis: alphabetical stack sort, to maximize merging – y-axis: stack depth – color: random (default), or a dimension • Perl + SVG + JavaScript – https://github.com/brendangregg/FlameGraph – Takes input from many different profilers – Multiple d3 versions are being developed • References: – http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html – http://queue.acm.org/detail.cfm?id=2927301 – "The Flame Graph" CACM, June 2016

47. Linux CPU Flame Graphs Linux 2.6+, via perf: git clone --depth 1 https://github.com/brendangregg/FlameGraph cd FlameGraph These files can be read using FlameScope perf record -F 99 -a –g -- sleep 30 perf script --header > out.perf01 ./stackcollapse-perf.pl < out.perf01 |./flamegraph.pl > perf.svg Linux 4.9+, via BPF: git clone --depth 1 https://github.com/brendangregg/FlameGraph git clone --depth 1 https://github.com/iovisor/bcc ./bcc/tools/profile.py -dF 99 30 | ./FlameGraph/flamegraph.pl > perf.svg – Most efficient: no perf.data file, summarizes in-kernel

48. FlameScope ● Analyze variance, perturbations Flame graph https://github.com/ Netflix/flamescope Subsecond-offset heat map

49. perf: Counters • Performance Monitoring Counters (PMCs): • $ perf list | grep –i hardware cpu-cycles OR cycles stalled-cycles-frontend OR idle-cycles-frontend stalled-cycles-backend OR idle-cycles-backend instructions […] L1-dcache-loads L1-dcache-load-misses […] rNNN (see 'perf list --help' on how to encode it) mem:<addr>[:access] [Hardware [Hardware [Hardware [Hardware event] event] event] event] [Hardware cache event] [Hardware cache event] [Raw hardware event … [Hardware breakpoint] • Measure instructions-per-cycle (IPC) and CPU stall types • PMCs only enabled for some cloud instance types My front-ends, incl. pmcarch: https://github.com/brendangregg/pmc-cloud-tools

50. 5. Tracing

51. Linux Tracing Events

52. Tracing Stack add-on tools: front-end tools: tracing frameworks: back-end instrumentation: trace-cmd, perf-tools, bcc, bpftrace perf Ftrace, perf_events, BPF tracepoints, kprobes, uprobes BPF enables a new class of custom, efficient, and production safe performance analysis tools in Linux

53. Ftrace: perf-tools funccount • Built-in kernel tracing capabilities, added by Steven Rostedt and others since Linux 2.6.27 # ./funccount -i 1 'bio_*' Tracing "bio_*"... Ctrl-C to end. FUNC [...] bio_alloc_bioset bio_endio bio_free bio_fs_destructor bio_init bio_integrity_enabled bio_put bio_add_page • Also see trace-cmd COUNT 536 536 536 536 536 536 729 1004

54. perf: Tracing Tracepoints ● perf was introduced earlier; it is also a powerful tracer # perf stat -e block:block_rq_complete -a sleep 10 Performance counter stats for 'system wide': 91 # [ [ # perf perf perf perf In-kernel counts (efficient) block:block_rq_complete record -e block:block_rq_complete -a sleep 10 record: Woken up 1 times to write data ] record: Captured and wrote 0.428 MB perf.data (~18687 samples) script run 30339 [000] 2083345.722857: block:block_rq_complete: 202,1 run 30339 [000] 2083345.723180: block:block_rq_complete: 202,1 swapper 0 [000] 2083345.723489: block:block_rq_complete: 202,1 swapper 0 [000] 2083346.745840: block:block_rq_complete: 202,1 supervise 30342 [000] 2083346.746571: block:block_rq_complete: 202,1 [...] Dump & post-process ] W () 12986336 W () 12986528 W () 12986496 WS () 1052984 WS () 1053128 http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Main_Page + + + + + 8 [0] 8 [0] 8 [0] 144 [0] 8 [0]

55. BCC/BPF: ext4slower • ext4 operations slower than the threshold: # ./ext4slower 1 Tracing ext4 operations TIME COMM 06:49:17 bash 06:49:17 cksum 06:49:17 cksum 06:49:17 cksum 06:49:17 cksum 06:49:17 cksum 06:49:17 cksum 06:49:17 cksum […] slower PID 3616 3616 3616 3616 3616 3616 3616 3616 than 1 ms T BYTES R 128 R 39552 R 96 R 96 R 10320 R 65536 R 55400 R 36792 OFF_KB 0 0 0 0 0 0 0 0 LAT(ms) 7.75 1.34 5.36 14.94 6.82 4.01 8.77 16.34 FILENAME cksum [ 2to3-2.7 2to3-3.4 411toppm a2p ab aclocal-1.14 • Better indicator of application pain than disk I/O • Measures & filters in-kernel for efficiency using BPF https://github.com/iovisor/bcc

56. bpftrace: one-liners • Block I/O (disk) events by type; by size & comm: # bpftrace -e 't:block:block_rq_issue { @[args->rwbs] = count(); }' Attaching 1 probe... ^C @[WS]: 2 @[RM]: 12 @[RA]: 1609 @[R]: 86421 # bpftrace -e 't:block:block_rq_issue { @bytes[comm] = hist(args->bytes); }' Attaching 1 probe... ^C @bytes[dmcrypt_write]: [4K, 8K) 68 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8K, 16K) 35 |@@@@@@@@@@@@@@@@@@@@@@@@@@ | [16K, 32K) 4 |@@@ | [32K, 64K) 1 | | [64K, 128K) 2 |@ | https://github.com/iovisor/bpftrace [...]

57. BPF Perf Tools (2019) BCC & bpftrace repos contain many of these. The book has them all.

58. Off-CPU Analysis • Explain all blocking events. High-overhead: needs BPF. directory read from disk file read from disk fstat from disk path read from disk pipe write

59. 6. Tuning

60. Ubuntu Bionic Tuning: Late 2019 (1/2) • • • • • CPU schedtool –B PID disable Ubuntu apport (crash reporter) upgrade to Bionic (scheduling improvements) Virtual Memory vm.swappiness = 0 # from 60 Memory echo madvise > /sys/kernel/mm/transparent_hugepage/enabled kernel.numa_balancing = 0 File System vm.dirty_ratio = 80 # from 40 vm.dirty_background_ratio = 5 # from 10 vm.dirty_expire_centisecs = 12000 # from 3000 mount -o defaults,noatime,discard,nobarrier … Storage I/O /sys/block/*/queue/rq_affinity 1 # or 2 /sys/block/*/queue/scheduler kyber /sys/block/*/queue/nr_requests 256 /sys/block/*/queue/read_ahead_kb 128 mdadm –chunk=64 …

61. Ubuntu Bionic Tuning: Late 2019 (2/2) • • • Networking net.core.default_qdisc = fq net.core.netdev_max_backlog = 5000 net.core.rmem_max = 16777216 net.core.somaxconn = 1024 net.core.wmem_max = 16777216 net.ipv4.ip_local_port_range = 10240 65535 net.ipv4.tcp_abort_on_overflow = 1 # maybe net.ipv4.tcp_congestion_control = bbr net.ipv4.tcp_max_syn_backlog = 8192 net.ipv4.tcp_rmem = 4096 12582912 16777216 # or 8388608 ... net.ipv4.tcp_slow_start_after_idle = 0 net.ipv4.tcp_syn_retries = 2 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_wmem = 4096 12582912 16777216 # or 8388608 ... Hypervisor echo tsc > /sys/devices/…/current_clocksource Plus use AWS Nitro Other net.core.bpf_jit_enable = 1 sysctl -w kernel.perf_event_max_stack=1000

62. Takeaways Systems Performance is: Observability, Methodologies, Benchmarking, Profiling, Tracing, Tuning Print out for your office wall: 1. uptime 2. dmesg -T | tail 3. vmstat 1 4. mpstat -P ALL 1 5. pidstat 1 6. iostat -xz 1 7. free -m 8. sar -n DEV 1 9. sar -n TCP,ETCP 1 10. top

63. Links Netflix Tech Blog on Linux: ● http://techblog.netflix.com/2015/11/linux-performance-analysis-in-60s.html ● http://techblog.netflix.com/2015/08/netflix-at-velocity-2015-linux.html Linux Performance: ● http://www.brendangregg.com/linuxperf.html Linux perf: ● https://perf.wiki.kernel.org/index.php/Main_Page ● http://www.brendangregg.com/perf.html Linux ftrace: ● https://www.kernel.org/doc/Documentation/trace/ftrace.txt ● https://github.com/brendangregg/perf-tools Linux BPF: ● http://www.brendangregg.com/ebpf.html ● http://www.brendangregg.com/bpf-performance-tools-book.html ● https://github.com/iovisor/bcc ● https://github.com/iovisor/bpftrace Methodologies: ● http://www.brendangregg.com/USEmethod/use-linux.html ● http://www.brendangregg.com/activebenchmarking.html Flame Graphs & FlameScope: ● http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html ● http://queue.acm.org/detail.cfm?id=2927301 ● https://github.com/Netflix/flamescope MSRs and PMCs ● https://github.com/brendangregg/msr-cloud-tools ● https://github.com/brendangregg/pmc-cloud-tools BPF Performance Tools

64. Thanks • • • • • Questions? http://slideshare.net/brendangregg http://www.brendangregg.com bgregg@netflix.com @brendangregg Look out for 2 nd Ed. USENIX LISA 2019, Portland, Oct 28-30