Serving Netflix Video at 400Gb/s on FreeBSD
如果无法正常显示,请先停止浏览器的去广告插件。
1. Serving Netflix Video at
400Gb/s on FreeBSD
Drew Gallatin
EuroBSDCon 2021
2. Outline:
●
●
●
●
●
●
Motivation
Description of production platform
Description of workload
To NUMA or not to NUMA?
Inline Hardware (NIC) kTLS
Alternate platforms
3. Motivation:
● Since 2020, Netflix has been able to
serve 200Gb/s of TLS encrypted video
traffic from a single server.
● How can we serve ~400Gb/s of video
from the same servers?
4. Netflix Video Serving Workload
● FreeBSD-current
● NGINX web server
● Video served via sendfile(2) and
encrypted using software kTLS
5. Netflix Video Serving Hardware
● AMD EPYC 7502P (“Rome”)
○ 32 cores @ 2.5GHz
○ 256GB DDR4-3200
■ 8 channels
■ ~150GB/s mem bw
● Or ~1.2Tb/s in networking units
○ 128 lanes PCIe Gen4
■ ~250GB/s of IO bandwidth
● Or ~2Tb/s in networking units
6. Netflix Video Serving Hardware
● 2x Mellanox ConnectX-6 Dx
○ Gen4 x16, 2 full speed 100GbE ports per NIC
■ 4 x 100GbE in total
○ Support for NIC kTLS offload
● 18x WD SN720 NVME
○ 2TB
○ PCIe Gen3 x4
7. Performance Results:
● 240Gb/s
● Limited by memory BW
○ Determined empirically by using
AMDuProfPCM
8. Netflix 400Gb/s Video Serving Data Flow
Bulk Data
Using sendfile and software kTLS,
data is encrypted by the host CPU.
Metadata
400Gb/s == 50GB/s
CPU
~200GB/sec of memory bandwidth
and ~64 PCIe Gen 4 lanes are
needed to serve 400Gb/s
50GB/s
Disks
50GB/s
Memory
Network Card
9. Can NUMA get us to 400Gb/s
● Use STREAM benchmark bandwidth as a proxy
○ Single Node: 150GB/s
○ Four Nodes: 175GB/s
10. What is NUMA?
Non Uniform Memory Architecture
That means memory and/or devices can
be “closer” to some CPU cores
11. Multi CPU Before NUMA
Disks
Memory
Memory access
was UNIFORM:
Memory
Disks
CPU
Network Card
CPU
North Bridge
Network Card
Each core had
equal and direct
access to all
memory and IO
devices.
12. Multi Socket system with NUMA:
Memory access can be
NON-UNIFORM
● Each core has
unequal access to
memory
● Each core has
unequal access to
I/O devices
Disks
NUMA Bus
Memory
Memory
CPU
Network Card
Disks
CPU
Network Card
13. Present day NUMA:
Node 0
Each locality zone
called a
“NUMA Domain” or
“NUMA Node”
Disks
Node 1
NUMA Bus
Memory
Memory
CPU
Network Card
Disks
CPU
Network Card
14. 4 Node configurations are
common on AMD EPYC
15. Cross-Domain costs
Latency Penalties:
● 12-28ns
16. Cross-Domain costs
Bandwidth Limit:
● AMD Infinity Fabric
○ ~47GB/s per link
○ ~280GB/s total
17. Strategy: Keep as much of our
200GB/sec of bulk data off the
NUMA fabric is possible
● Bulk data congests NUMA fabric and leads to
CPU stalls when competing with normal memory
accesses.
18. 4 Nodes, worst case
Steps to send data:
19. 4 Nodes, worst case
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
20. 4 Nodes, worst case
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● CPU reads data for encryption
○ Second NUMA crossing
21. 4 Nodes, worst case
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● CPU reads data for encryption
○ Second NUMA crossing
● CPU writes data for encryption
○ Third NUMA crossing
22. 4 Nodes, worst case
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● CPU reads data for encryption
○ Second NUMA crossing
● CPU writes data for encryption
○ Third NUMA crossing
● DMA data from memory to network
○ Fourth NUMA crossing
23. Worst Case Summary:
● 4 NUMA crossings
● 200GB/s of data on the NUMA fabric
○ Fabric saturates, cannot handle the load.
○ CPU Stalls, saturates early
24. Best Case Summary:
● 0 NUMA crossings
● 0GB/s of data on the NUMA
fabric
25. How can we get as close as
possible to the best case?
Constrained to use 1 IP address
per host
● Must use lagg(4) LACP network
bonding
●
26. Impose order on the chaos..
somehow:
● Disk centric siloing
○ Try to do everything on the NUMA node where
the content is stored
● Network centric siloing
○ Try to do as much as we can on the NUMA
node that the LACP partner chose for us
27. Network centric siloing
● Associate network connections with NUMA nodes
● Allocate local memory to back media files when
they are DMA’ed from disk
● Allocate local memory for TLS crypto destination
buffers & do SW crypto locally
● Run kTLS workers, RACK / BBR TCP pacers with
domain affinity
● Choose local lagg(4) egress port
All of this is upstream!
28. 4 Nodes, worst case with siloing
Steps to send data:
29. 4 Nodes, worst case with siloing
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
30. 4 Nodes, worst case with siloing
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● CPU reads data for encryption
31. 4 Nodes, worst case with siloing
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● CPU reads data for encryption
● CPU writes data for encryption
32. 4 Nodes, worst case with siloing
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● CPU reads data for encryption
● CPU writes data for encryption
● DMA data from memory to network
33. Worst Case Summary:
● 1 NUMA crossing on average
○ 100% of disk reads across NUMA
● 50GB/s of data on each NUMA fabric link
○ Much less than the 280GB/sec of Inifinity fabric
bandwidth
34. Real Life is Messy
●
●
●
●
NICs on only 2 of the 4 NUMA nodes
Differing number of NVME on each node
Hacks to “pretend” we have NICs in all 4 domains
Impacts worst and average cases
35. 4 Nodes,worst case with siloing: messy
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● CPU reads data for encryption
● CPU writes data for encryption
● DMA data from memory to network
36. Worst Case Summary:
● 2 NUMA crossing on average
○ 100% of disk reads across NUMA
○ 100% of network writes across NUMA
● 100GB/s of data on the NUMA fabric
○ Less than the 280GB/s of Inifinity fabric
bandwidth
37. Average Case Summary:
● 1.25 NUMA crossings on average
○ 75% of disk reads across NUMA
○ 50% of NIC transmits across NUMA due to
unbalanced setup
● 62.5 GB/sec of data on NUMA fabric
38. Performance: 1 vs 4 nodes
39. Would NIC based kTLS
offload help for 400Gb/s ?
40. Netflix 400Gb/s Video Serving Data Flow
Bulk Data
Using sendfile and software kTLS,
data is encrypted by the host CPU.
Metadata
400Gb/s == 50GB/s
CPU
~200GB/sec of memory bandwidth
and ~64 PCIe Gen 4 lanes are
needed to serve 400Gb/s
50GB/s
Disks
50GB/s
Memory
Network Card
41. Netflix 400Gb/s Video Serving Data Flow
Bulk Data
Using sendfile and software kTLS,
data is encrypted by the host CPU.
Metadata
400Gb/s == 50GB/s
CPU
~200GB/sec of memory bandwidth
and ~64 PCIe Gen 4 lanes are
needed to serve 400Gb/s
50GB/s
Disks
50GB/s
Memory
Network Card
42. Netflix 400Gb/s Video Serving Data Flow
Bulk Data
Using sendfile and NIC kTLS, data
is encrypted by the NIC.
Metadata
400Gb/s == 50GB/s
~100GB/sec of memory bandwidth
and ~64 PCIe Gen 4 lanes are
needed to serve 400Gb/s
CPU
50GB/s
Disks
50GB/s
Memory
Network Card
43. What is NIC kTLS?:
● Hardware Inline TLS
● TLS session is established in userspace.
● When crypto is moved to the kernel, the kernel
passes crypto keys to the NIC
● TLS records are encrypted by NIC as the data
flows through it on transmit
○ No more detour through the CPU for crypto
○ This cuts memory BW requirements in half!
44. Mellanox ConnectX-6 Dx
● Offloads TLS 1.2 and 1.3 for AES GCM cipher
● Retains crypto state within a TLS record
○ Means that the TCP stack can send partial
TLS records without performance loss
● If a packet is sent out of order (eg, a TCP
retransmit), it must re-DMA the record containing
the out of order packet
45. CX6-DX: In-order Transmit
46. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
1448
0
47. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
1448
0
48. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448
0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
TCP segments of Encrypted TLS Record
49. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448 0
1448 0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
7240
5792
4344
2896
100GbE
Network
TCP segments of Encrypted TLS Record
50. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
1448
0
51. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448
0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
TCP segments of Encrypted TLS Record
52. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448
0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
14480
13032 Network
11584 10136
100GbE
8688
TCP segments of Encrypted TLS Record
53. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
1448
0
54. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448
0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
TCP segments of Encrypted TLS Record
55. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448
0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE 15928
Network
TCP segments of Encrypted TLS Record
56. CX6-DX: TCP Retransmit
57. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
1448
0
58. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
1448
0
59. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448
0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE Network
TCP segments of Encrypted TLS Record
60. Host Memory
15928
14480
13032
11584
10136
8688
7240
5792
4344
2896
1448
0
TCP segments of Plaintext TLS Record
PCIe
Bus
NIC
100GbE 15928
Network
TCP segments of Encrypted TLS Record
61. CX6-DX: Initial Results
Peak: 125Gb/s per NIC, (~250Gb/s total)
Sustained: 75Gb/s per NIC, (~150Gb/s total)
● Pre-release Firmware
62. CX6-DX: Initial performance
● NIC stores TLS state per-session
● We have a lot of sessions active
○ (~400k sessions for 400Gb/s)
○ Performance gets worse the more sessions we
add
● Limited memory on-board NIC
○ NIC pages in and out to buffers in host RAM
○ Buffers managed by NIC
63. PCIe Relaxed Ordering
● Allows PCIe transactions to pass each other
○ Should eliminate pipeline bubbles due to “slow”
reads delaying fast ones.
○ May help with “paging in” TLS connection state
● Enabled Relaxed Ordering
○ Didn’t help
○ Turns out CX6-DX pre-release firmware hardcoded
Relaxed Ordering to disabled
64. CX6-DX: Results from next firmware
● Firmware update enabled Relaxed Ordering on NIC
● Peak results improved:
160Gb/s per NIC (~320Gb/s total)
● Note that peak and sustained were effectively
identical from this fw update forward.
● This is a new record!
● Nearly as fast as SW TLS (per NIC): 160Gb/s vs
190Gb/s, much faster overall
65. CX6-DX: Results from production fw
● Firmware update added “TLS_OPTIMIZE” setting
● Peak & sustained results improved:
190Gb/s per NIC (~380Gb/s total)!
66. CX6-DX: What’s needed to use of
NIC TLS in production at Netflix?
● QoE testing
○ Measure various factors, such as rebuffer rate,
play delay, time to quality, etc.
○ Initial results are great
○ Larger, more complete study scheduled soon.
67. CX6-DX: What’s needed to use of
NIC TLS in production at Netflix?
● Track retransmits & move sessions to software
○ Monitor bytes retransmitted for lossy networks
○ Monitor segments retransmitted to protect against
attacks
68. CX6-DX:Mixed HW/SW session perf?
● Moving a non-trivial percentage of conns to SW has
unanticipated BW cost.
● Setting SW switch threshold to 1% bytes retransmitted
moves ⅓ of conns to SW
● Max stable BW moves from 380Gb/s to 350Gb/s with
roughly ⅓ of connections in SW
○ Performance impact is more than expected
69. 4 Nodes, worst case with siloing +
NIC kTLS
Steps to send data:
70. 4 Nodes, worst case with siloing +
NIC kTLS
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
71. 4 Nodes, worst case with siloing +
NIC kTLS
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● DMA data from memory to network
72. 4 Nodes, worst case with siloing +
NIC kTLS
Steps to send data:
● DMA data from disk to memory
○ First NUMA bus crossing
● DMA data from memory to network
73. Worst Case Summary:
● 2 NUMA crossing on average
○ 100% of disk reads across NUMA
○ 100% of network writes across NUMA
● 100GB/s of data on the NUMA fabric
○ Less than the 280GB/s of Inifinity fabric
bandwidth
74. Average Case Summary:
● 1.25 NUMA crossings on average
○ 75% of disk reads across NUMA
○ 50% of NIC transmits across NUMA due to
unbalanced setup
● 62.5 GB/sec of data on NUMA fabric
75.
76. Other platforms? Ampere Altra
● “Mt. Snow”
○ Q80-30: 80 3.0GHz Arm Neoverse-N1 cores
○ 8 channels of 256GB DDR4-3200
○ 128 Lanes Gen4 PCIe
○ 16x WD SN720 2TB NVMe
○ 2 Mellanox CX6-DX NICs
77. Other platforms? Ampere Altra
● Minimal access to system counters
○ No way to see memory BW usage
○ No way to see IO bandwidth or latency
○ Leads to feeling like you’re driving blind
78. Other platforms? Ampere Altra
● Poor performance with SW kTLS:
○ CPU limited at 180Gb/s
● Poor initial performance with NIC TLS
○ PCIe limited at 240Gb/s
■ Very low CPU utilization
■ NICs saturated, and we see lots of output
drops
79. Ampere: PCIe Extended Tags
●
●
●
●
Poor initial performance with NIC TLS: 240Gb/s
Very low CPU utilization
NICs saturated, and we see lots of output drops
Seems like a PCIe problem
80. Ampere: PCIe Extended Tags
● PCIe is more of a network than a bus
● Number of outstanding DMA reads is limited by the
number of PCIe “tags”
● PCIe tag space is 5-bits by default, allowing for 32
DMAs to be in-flight at the same time
● PCIe extended tags increase the tag space to 8 bits,
allowing 256 DMA reads in flight at the same time
● Like increasing TCP window size.
81. Ampere: PCIe Extended Tags
● After enabling extended tags, we see a bandwidth
improvement:
240Gb/s -> 320Gb/s
82. Other platforms? Intel Ice Lake Xeon
● 8352V CPU
○ 36 cores, 2.1GHz
○ 8 channels 256GB DDR4-3200 (running at 2933)
○ 64 Lanes Gen4 PCIe
○ 20x Kioxia 4TB NVMe (PCIe Gen4)
○ 2 Mellanox CX6-DX NICs
83. Intel Ice Lake Xeon
● 230Gb/s SW kTLS
○ Limited by memory BW
84. Intel Ice Lake Xeon (WIP)
● 230Gb/s SW kTLS
○ Limited by memory BW
■ 8352V runs memory at 2993, others SKUs run
at 3200
● Would expect the same performance as
AMD from that
● BIOS locked out PCIe Relaxed ordering, so no NIC
KTLS results yet
85.
86.
87. But wait, there’s …. not … more..
● 800Gb prototype sitting on datacenter floor due to
shipping exception ?
● Something to talk about
next year?
88. Many thanks to:
● Warren Harrop & the Netflix Open
Connect hardware team for putting
together the testbed.
● FreeBSD developers for making such an
awesome OS
Slides at:
https://people.freebsd.org/~gallatin/talks/euro2021.pdf
89. Disk centric siloing
●
●
●
●
Associate disk controllers with NUMA nodes
Associate NUMA affinity with files
Associate network connections with NUMA nodes
Move connections to be “close” to the disk where
the contents file is stored.
● After the connection is moved, there will be 0
NUMA crossings!
90. Disk centric siloing problems
● No way to tell link partner that we want LACP to
direct traffic to a different switch/router port
○ So TCP acks and http requests will come in on
the “wrong” port
● Moving connections can lead to TCP re-ordering
due to using multiple egress NICs
● Some clients issue http GET requests for different
content on the same TCP connection
○ Content may be on different NUMA domains!
91. Disk centric siloing problems
● Different numbers of NVME drives on each
domain
○ Node 3 has 3x the number of NVME drives as
Node 0
● Content popularity differences can lead to hot and
cold disks
● All of this adds up to uneven use of each Numa
Node.
○ Output limited by hottest Numa node
92. Disk centric siloing problems
● Moving established NIC TLS sessions to a
different egress NIC is painful
93. Disk centric siloing problems
● Moving NIC TLS sessions is expensive
○ Session will be established before content
location is known
94. AMD: NUMA w/NIC kTLS Offload
● Disk Siloing
○ Allocate host pages to back files on NUMA node
close to NVME, not NIC
○ Eliminates the 0.75 crossings for 4 domains with
NVME
○ Still have the 0.5 crossings on average for
remapped NICs
95. AMD: NUMA w/NIC kTLS Offload
● Disk Siloing
○ Assumes equal number of NVME on each node
○ Actual machine has:
■ Node 0: 2 NVME
■ Node 1: 6 NVME
■ Node 2: 4 NVME
■ Node 3: 6 NVME
96. AMD: NUMA w/NIC kTLS Offload
● Disk Siloing
○ Peak of ~300Gb/s
○ Traffic unequal due to more NVME on Node 3
○ Output drops on mce3 (NIC port on Node 3) at
98Gb/s, while mce0 (NIC port on Node 1) is mostly
idle at 40Gb/s
○ Tried “remapping” NVME and pretending some
drives in different domains
97. AMD: NUMA w/NIC kTLS Offload
● Disk Siloing
○ Pretended some of Node 3’s NVME drives were in
Node 1
■ Reached a peak of ~350Gb/s
■ Output still uneven between domains because
of uneven popularity of content on different
NVME drives
○ Sharding based on network (LACP) far more even