Optimizing Kubernetes Resource Requests-Limits
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. Optimizing
Kubernetes
Resource
Requests/Limits
JAX DEVOPS
LONDON
2019-05-15
HENNING JACOBS
@try_except_
2. EUROPE’S LEADING ONLINE FASHION PLATFORM
2
3. ZALANDO AT A GLANCE
~ 5.4
billion EUR
> 250
million
revenue 2018
3
> 15.000 > 79%
employees in
Europe of visits via
mobile devices
visits
per
month
> 300.000
> 26 product choices
million ~ 2.000 17
brands countries
active customers
4. SCALE
380
Accounts
118
4
Clusters
5. DEVELOPERS USING KUBERNETES
5
6. 6
7. Is this a lot? Is this cost efficient?
7
8. ¯\_(ツ)_/¯
Do you know your per unit costs?
8
9. THE MAGIC DIAL
Speed
Stability
Overprovision
Higher Cost
9
Efficiency
Risk
Overcommit
Lower Cost
10. THE BASICS
10
11. KUBERNETES: IT'S ALL ABOUT RESOURCES
Pods demand capacity
Scheduler
Nodes offer capacity
Node
11
Node
12. COMPUTE RESOURCE TYPES
● CPU
● Memory
● Local ephemeral storage (1.12+)
● Extended Resources
○ GPU
○ TPU?
12
Node
13. KUBERNETES RESOURCES
CPU
○ Base: 1 AWS vCPU (or GCP Core or ..)
○ Example: 100m (0.1 vCPU, "100 Millicores")
Memory
○ Base: 1 Byte
○ Example: 500Mi (500 MiB memory)
13
14. REQUESTS / LIMITS
Requests
○ Affect Scheduling Decision
○ Priority (CPU, OOM adjust)
Limits
○ Limit maximum container usage
14
resources:
requests:
cpu: 100m
memory: 300Mi
limits:
cpu: 1
memory: 300Mi
15. REQUESTS: POD SCHEDULING
Node 1
CPU
Pod 1
Memory
Pod 3
Node 2
CPU
Memory
15
Pod 2
CPU
Requests
Memory
16. POD SCHEDULING
Node 1
CPU
Memory
Node 2
CPU
Memory
16
Pod 4
17. POD SCHEDULING: TRY TO FIT
Node 1
CPU
Memory
Node 2
CPU
Memory
17
18. POD SCHEDULING: NO CAPACITY
Node 1
"PENDING"
CPU
Memory
Node 2
CPU
Memory
18
Pod 4
19. REQUESTS: CPU SHARES
kubectl run --requests=cpu=10m/5m ..sha512()..
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod5d5..0d/cpu.shares
10
// relative share of CPU time
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod6e0..0d/cpu.shares
5
// relative share of CPU time
cat /sys/fs/cgroup/cpuacct/kubepods/burstable/pod5d5..0d/cpuacct.usage
/sys/fs/cgroup/cpuacct/kubepods/burstable/pod6e0..0d/cpuacct.usage
13432815283 // total CPU time in nanoseconds
7528759332 // total CPU time in nanoseconds
19
20. LIMITS: COMPRESSIBLE RESOURCES
Can be taken away quickly,
"only" cause slowness
CPU Throttling
200m CPU limit
⇒ container can use 0.2s of CPU time per second
20
21. CPU THROTTLING
docker run --cpus CPUS -it python
python -m timeit -s 'import hashlib' -n 10000 -v
'hashlib.sha512().update(b"foo")'
21
CPUS=1.0 3.8 - 4ms
CPUS=0.5 3.8 - 52ms
CPUS=0.2 6.8 - 88ms
CPUS=0.1 5.7 - 190ms
more CPU throttling,
slower hash computation
22. LIMITS: NON-COMPRESSIBLE RESOURCES
Hold state,
are slower to take away.
⇒ Killing (OOMKill)
22
23. MEMORY LIMITS: OUT OF MEMORY
kubectl get pod
NAME
kube-ops-view-7bc-tcwkt
READY
0/1
STATUS
RESTARTS
CrashLoopBackOff
3
kubectl describe pod kube-ops-view-7bc-tcwkt
...
Last State:
Terminated
Reason:
OOMKilled
Exit Code: 137
23
AGE
2m
24. QUALITY OF SERVICE (QOS)
Guaranteed: all containers have limits == requests
Burstable: some containers have limits > requests
BestEffort: no requests/limits set
kubectl describe pod …
Limits:
memory:
Requests:
cpu:
memory:
QoS Class:
24
100Mi
100m
100Mi
Burstable
25. OVERCOMMIT
Limits > Requests ⇒ Burstable QoS ⇒ Overcommit
For CPU: fine, running into completely fair scheduling
For memory: fine, as long as demand < node capacity
Might run into unpredictable OOM
situations when demand reaches node's
memory capacity (Kernel OOM Killer)
25
https://code.fb.com/production-engineering/oomd/
26. LIMITS: CGROUPS
docker run --cpus 1 -m 200m --rm -it busybox
cat /sys/fs/cgroup/cpu/docker/8ab25..1c/cpu.{shares,cfs_*}
1024
// cpu.shares (default value)
100000 // cpu.cfs_period_us (100ms period length)
100000 // cpu.cfs_quota_us (total CPU time in µs consumable per period)
cat /sys/fs/cgroup/memory/docker/8ab25..1c/memory.limit_in_bytes
209715200
26
27. LIMITS: PROBLEMS
1. CPU CFS Quota: Latency
2. Memory: accounting, OOM behavior
27
28. PROBLEMS: LATENCY
28
https://github.com/zalando-incubator/kubernetes-on-aws/pull/923
29. PROBLEMS: HARDCODED PERIOD
29
30. PROBLEMS: HARDCODED PERIOD
30
https://github.com/kubernetes/kubernetes/issues/51135
31. NOW IN KUBERNETES 1.12
31
https://github.com/kubernetes/kubernetes/pull/63437
32. OVERLY AGGRESSIVE CFS
Usage < Limit,
but heavy
throttling
32
33. OVERLY AGGRESSIVE CFS: EXPERIMENT #1
CPU Period: 100ms
CPU Quota: None
Burn 5ms and sleep 100ms
⇒ Quota disabled
⇒ No Throttling expected!
33
https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1
34. EXPERIMENT #1: NO QUOTA, NO THROTTLING
2018/11/03 13:04:02 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 6ms
2018/11/03 13:04:03 [1] burn took 5ms, real time so far: 510ms, cpu time so far: 11ms
2018/11/03 13:04:03 [2] burn took 5ms, real time so far: 1015ms, cpu time so far: 17ms
2018/11/03 13:04:04 [3] burn took 5ms, real time so far: 1520ms, cpu time so far: 23ms
2018/11/03 13:04:04 [4] burn took 5ms, real time so far: 2025ms, cpu time so far: 29ms
2018/11/03 13:04:05 [5] burn took 5ms, real time so far: 2530ms, cpu time so far: 35ms
2018/11/03 13:04:05 [6] burn took 5ms, real time so far: 3036ms, cpu time so far: 40ms
2018/11/03 13:04:06 [7] burn took 5ms, real time so far: 3541ms, cpu time so far: 46ms
2018/11/03 13:04:06 [8] burn took 5ms, real time so far: 4046ms, cpu time so far: 52ms
2018/11/03 13:04:07 [9] burn took 5ms, real time so far: 4551ms, cpu time so far: 58ms
34
35. OVERLY AGGRESSIVE CFS: EXPERIMENT #2
CPU Period: 100ms
CPU Quota: 20ms
Burn 5ms and sleep 500ms
⇒ No 100ms intervals where possibly 20ms is burned
⇒ No Throttling expected!
35
36. EXPERIMENT #2: OVERLY AGGRESSIVE CFS
2018/11/03 13:05:05 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 5ms
2018/11/03 13:05:06 [1] burn took 99ms, real time so far: 690ms, cpu time so far: 9ms
2018/11/03 13:05:06 [2] burn took 99ms, real time so far: 1290ms, cpu time so far: 14ms
2018/11/03 13:05:07 [3] burn took 99ms, real time so far: 1890ms, cpu time so far: 18ms
2018/11/03 13:05:07 [4] burn took 5ms, real time so far: 2395ms, cpu time so far: 24ms
2018/11/03 13:05:08 [5] burn took 94ms, real time so far: 2990ms, cpu time so far: 27ms
2018/11/03 13:05:09 [6] burn took 99ms, real time so far: 3590ms, cpu time so far: 32ms
2018/11/03 13:05:09 [7] burn took 5ms, real time so far: 4095ms, cpu time so far: 37ms
2018/11/03 13:05:10 [8] burn took 5ms, real time so far: 4600ms, cpu time so far: 43ms
2018/11/03 13:05:10 [9] burn took 5ms, real time so far: 5105ms, cpu time so far: 49ms
36
37. OVERLY AGGRESSIVE CFS: EXPERIMENT #3
CPU Period: 10ms
CPU Quota: 2ms
Burn 5ms and sleep 100ms
⇒ Same 20% CPU (200m) limit, but smaller period
⇒ Throttling expected!
37
38. SMALLER CPU PERIOD ⇒ BETTER LATENCY
2018/11/03 16:31:07 [0] burn took 18ms, real time so far: 18ms, cpu time so far: 6ms
2018/11/03 16:31:07 [1] burn took 9ms, real time so far: 128ms, cpu time so far: 8ms
2018/11/03 16:31:07 [2] burn took 9ms, real time so far: 238ms, cpu time so far: 13ms
2018/11/03 16:31:07 [3] burn took 5ms, real time so far: 343ms, cpu time so far: 18ms
2018/11/03 16:31:07 [4] burn took 30ms, real time so far: 488ms, cpu time so far: 24ms
2018/11/03 16:31:07 [5] burn took 19ms, real time so far: 608ms, cpu time so far: 29ms
2018/11/03 16:31:07 [6] burn took 9ms, real time so far: 718ms, cpu time so far: 34ms
2018/11/03 16:31:08 [7] burn took 5ms, real time so far: 824ms, cpu time so far: 40ms
2018/11/03 16:31:08 [8] burn took 5ms, real time so far: 943ms, cpu time so far: 45ms
2018/11/03 16:31:08 [9] burn took 9ms, real time so far: 1068ms, cpu time so far: 48ms
38
39. INCIDENT INVOLVING CPU THROTTLING
39
https://k8s.af
40. LIMITS: VISIBILITY
docker run --cpus 1 -m 200m --rm -it busybox top
Mem: 7369128K used, 726072K free, 128164K
CPU0: 14.8% usr 8.4% sys 0.2% nic 67.6%
CPU1: 8.8% usr 10.3% sys 0.0% nic 75.9%
CPU2: 7.3% usr 8.7% sys 0.0% nic 63.2%
CPU3: 9.3% usr 9.9% sys 0.0% nic 65.7%
40
shrd, 303924K buff, 1208132K cached
idle 8.2% io 0.0% irq 0.6% sirq
idle 4.4% io 0.0% irq 0.4% sirq
idle 20.1% io 0.0% irq 0.6% sirq
idle 14.5% io 0.0% irq 0.4% sirq
41. LIMITS: VISIBILITY
• Container-aware memory configuration
• JVM MaxHeap
• Container-aware processor configuration
• Thread pools
• GOMAXPROCS
• node.js cluster module
41
42. KUBERNETES RESOURCES
42
43. ZALANDO: DECISION
1. Forbid Memory Overcommit
• Implement mutating admission webhook
• Set requests = limits
2. Disable CPU CFS Quota in all clusters
• --cpu-cfs-quota=false
43
44. INGRESS LATENCY IMPROVEMENT
44
45. CLUSTER AUTOSCALER
Simulates the Kubernetes scheduler internally to find out..
• ..if any of the pods wouldn’t fit on existing nodes
⇒ upscale is needed
• ..if it’s possible to fit some of the pods on existing nodes
⇒ downscale is needed
⇒ Cluster size is determined by resource requests
(+ constraints)
45
github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
46. AUTOSCALING BUFFER
• Cluster Autoscaler only triggers on Pending Pods
• Node provisioning is slow
⇒ Reserve extra capacity via low priority Pods
"Autoscaling Buffer Pods"
46
47. AUTOSCALING BUFFER
kubectl describe pod autoscaling-buffer-..zjq5 -n kube-system
...
Namespace:
kube-system
Evict if higher
Priority:
-1000000
priority (default)
PriorityClassName:
autoscaling-buffer
Containers:
Pod needs
pause:
capacity
Image:
teapot/pause-amd64:3.1
Requests:
cpu:
1600m
memory:
6871947673
47
48. ALLOCATABLE
Reserve resources for
system components,
Kubelet, and container runtime:
--system-reserved=\
cpu=100m,memory=164Mi
--kube-reserved=\
cpu=100m,memory=282Mi
48
49. SLACK
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
Node
CPU
Memory
"Slack"
49
50. STRANDED RESOURCES
Node 1
CPU
Some available capacity
can become unusable /
stranded.
Memory
Node 2
⇒ Reschedule, bin packing
CPU
Memory
50
Stranded
51. MONITORING
COST EFFICIENCY
51
52. KUBERNETES RESOURCE REPORT
52
github.com/hjacobs/kube-resource-report
53. RESOURCE REPORT: TEAMS
Sorting teams by
Slack Costs
53
github.com/hjacobs/kube-resource-report
54. RESOURCE REPORT: APPLICATIONS
"Slack"
54
55. RESOURCE REPORT: APPLICATIONS
55
56. RESOURCE REPORT: CLUSTERS
"Slack"
56
github.com/hjacobs/kube-resource-report
57. RESOURCE REPORT METRICS
57
github.com/hjacobs/kube-resource-report
58. KUBERNETES APPLICATION DASHBOARD
58
59. https://github.com/hjacobs/kube-ops-view
60. requested
vs used
https://github.com/hjacobs/kube-ops-view
61. OPTIMIZING
COST EFFICIENCY
61
62. VERTICAL POD AUTOSCALER (VPA)
"Some 2/3 of the (Google) Borg
users use autopilot."
- Tim Hockin
VPA: Set resource requests
automatically based on usage.
62
63. apiVersion: autoscaling.k8s.io/v1beta2
kind: VerticalPodAutoscaler
metadata: ...
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: prometheus
updatePolicy: { updateMode: Auto }
resourcePolicy:
containerPolicies: { containerName: prometheus }
minAllowed:
memory: 512Mi
maxAllowed:
memory: 10Gi
63
VPA FOR PROMETHEUS
64. VERTICAL POD AUTOSCALER
limit/requests adapted by VPA
64
65. VERTICAL POD AUTOSCALER
65
66. HORIZONTAL POD AUTOSCALER
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 3
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 100
66
...
target: ~100% of
CPU requests
67. HORIZONTAL POD AUTOSCALING (CUSTOM METRICS)
67
Queue Length Ingress Req/s
Prometheus Query ZMON Check
github.com/zalando-incubator/kube-metrics-adapter
68. DOWNSCALING DURING OFF-HOURS
Weekend
68
github.com/hjacobs/kube-downscaler
69. DOWNSCALING DURING OFF-HOURS
DEFAULT_UPTIME="Mon-Fri 07:30-20:30 CET"
annotations:
downscaler/exclude: "true"
69
github.com/hjacobs/kube-downscaler
70. ACCUMULATED WASTE
● Prototypes
● Personal test environments
● Trial runs
● Decommissioned services
● Learning/training deployments
Sounds familiar?
70
71. Example: Getting started
with Zalenium & UI Tests
Example: Step by step guide to the first UI test with Zalenium running in the
Continuous Delivery Platform. I was always afraid of UI tests because it looked too
difficult to get started, Zalenium solved this problem for me.
72. HOUSEKEEPING
● Delete prototypes
after X days
● Clean up temporary
deployments
● Remove resources
without owner
72
73. KUBERNETES JANITOR
● TTL and expiry date annotations, e.g.
○ set time-to-live for your test deployment
● Custom rules, e.g.
○ delete everything without "app" label after 7 days
73
github.com/hjacobs/kube-janitor
74. JANITOR TTL ANNOTATION
# let's try out nginx, but only for 1 hour
kubectl run nginx --image=nginx
kubectl annotate deploy nginx janitor/ttl=1h
74
github.com/hjacobs/kube-janitor
75. CUSTOM JANITOR RULES
# require "app" label for new pods starting April 2019
- id: require-app-label-april-2019
resources:
- deployments
- statefulsets
jmespath: "!(spec.template.metadata.labels.app) &&
metadata.creationTimestamp > '2019-04-01'"
ttl: 7d
75
github.com/hjacobs/kube-janitor
76. EC2 SPOT NODES
72% savings
76
77. SPOT ASG / LAUNCH TEMPLATE
77
Not upstream in cluster-autoscaler (yet)
78. CLUSTER OVERHEAD: CONTROL PLANE
● GKE cluster: free
● EKS cluster: $146/month
● Zalando prod cluster: $635/month
(etcd nodes + master nodes + ELB)
Potential: fewer etcd nodes, no HA, shared control plane.
78
79. WHAT WORKED FOR US
● Disable CPU CFS Quota in all clusters
● Prevent memory overcommit
● Kubernetes Resource Report
● Downscaling during off-hours
● EC2 Spot
79
80. STABILITY ↔ EFFICIENCY
Slack
Autoscaling
Buffer
Disable
Overcommit
Cluster
Overhead
80
Resource
Report
HPA
VPA
Downscaler
Janitor
EC2 Spot
81. OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
Kubernetes Janitor
github.com/hjacobs/kube-janitor
81
82. OTHER TALKS/POSTS
82
• Everything You Ever Wanted to Know About Resource Scheduling
• Inside Kubernetes Resource Management (QoS) - KubeCon 2018
• Setting Resource Requests and Limits in Kubernetes (Best Practices)
• Effectively Managing Kubernetes Resources with Cost Monitoring
83. QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k