Optimizing Kubernetes Resource Requests-Limits

1. Optimizing Kubernetes Resource Requests/Limits JAX DEVOPS LONDON 2019-05-15 HENNING JACOBS @try_except_

2. EUROPE’S LEADING ONLINE FASHION PLATFORM 2

3. ZALANDO AT A GLANCE ~ 5.4 billion EUR > 250 million revenue 2018 3 > 15.000 > 79% employees in Europe of visits via mobile devices visits per month > 300.000 > 26 product choices million ~ 2.000 17 brands countries active customers

4. SCALE 380 Accounts 118 4 Clusters

5. DEVELOPERS USING KUBERNETES 5

6. 6

7. Is this a lot? Is this cost efficient? 7

8. ¯\_(ツ)_/¯ Do you know your per unit costs? 8

9. THE MAGIC DIAL Speed Stability Overprovision Higher Cost 9 Efficiency Risk Overcommit Lower Cost

10. THE BASICS 10

11. KUBERNETES: IT'S ALL ABOUT RESOURCES Pods demand capacity Scheduler Nodes offer capacity Node 11 Node

12. COMPUTE RESOURCE TYPES ● CPU ● Memory ● Local ephemeral storage (1.12+) ● Extended Resources ○ GPU ○ TPU? 12 Node

13. KUBERNETES RESOURCES CPU ○ Base: 1 AWS vCPU (or GCP Core or ..) ○ Example: 100m (0.1 vCPU, "100 Millicores") Memory ○ Base: 1 Byte ○ Example: 500Mi (500 MiB memory) 13

14. REQUESTS / LIMITS Requests ○ Affect Scheduling Decision ○ Priority (CPU, OOM adjust) Limits ○ Limit maximum container usage 14 resources: requests: cpu: 100m memory: 300Mi limits: cpu: 1 memory: 300Mi

15. REQUESTS: POD SCHEDULING Node 1 CPU Pod 1 Memory Pod 3 Node 2 CPU Memory 15 Pod 2 CPU Requests Memory

16. POD SCHEDULING Node 1 CPU Memory Node 2 CPU Memory 16 Pod 4

17. POD SCHEDULING: TRY TO FIT Node 1 CPU Memory Node 2 CPU Memory 17

18. POD SCHEDULING: NO CAPACITY Node 1 "PENDING" CPU Memory Node 2 CPU Memory 18 Pod 4

19. REQUESTS: CPU SHARES kubectl run --requests=cpu=10m/5m ..sha512().. cat /sys/fs/cgroup/cpu/kubepods/burstable/pod5d5..0d/cpu.shares 10 // relative share of CPU time cat /sys/fs/cgroup/cpu/kubepods/burstable/pod6e0..0d/cpu.shares 5 // relative share of CPU time cat /sys/fs/cgroup/cpuacct/kubepods/burstable/pod5d5..0d/cpuacct.usage /sys/fs/cgroup/cpuacct/kubepods/burstable/pod6e0..0d/cpuacct.usage 13432815283 // total CPU time in nanoseconds 7528759332 // total CPU time in nanoseconds 19

20. LIMITS: COMPRESSIBLE RESOURCES Can be taken away quickly, "only" cause slowness CPU Throttling 200m CPU limit ⇒ container can use 0.2s of CPU time per second 20

21. CPU THROTTLING docker run --cpus CPUS -it python python -m timeit -s 'import hashlib' -n 10000 -v 'hashlib.sha512().update(b"foo")' 21 CPUS=1.0 3.8 - 4ms CPUS=0.5 3.8 - 52ms CPUS=0.2 6.8 - 88ms CPUS=0.1 5.7 - 190ms more CPU throttling, slower hash computation

22. LIMITS: NON-COMPRESSIBLE RESOURCES Hold state, are slower to take away. ⇒ Killing (OOMKill) 22

23. MEMORY LIMITS: OUT OF MEMORY kubectl get pod NAME kube-ops-view-7bc-tcwkt READY 0/1 STATUS RESTARTS CrashLoopBackOff 3 kubectl describe pod kube-ops-view-7bc-tcwkt ... Last State: Terminated Reason: OOMKilled Exit Code: 137 23 AGE 2m

24. QUALITY OF SERVICE (QOS) Guaranteed: all containers have limits == requests Burstable: some containers have limits > requests BestEffort: no requests/limits set kubectl describe pod … Limits: memory: Requests: cpu: memory: QoS Class: 24 100Mi 100m 100Mi Burstable

25. OVERCOMMIT Limits > Requests ⇒ Burstable QoS ⇒ Overcommit For CPU: fine, running into completely fair scheduling For memory: fine, as long as demand < node capacity Might run into unpredictable OOM situations when demand reaches node's memory capacity (Kernel OOM Killer) 25 https://code.fb.com/production-engineering/oomd/

26. LIMITS: CGROUPS docker run --cpus 1 -m 200m --rm -it busybox cat /sys/fs/cgroup/cpu/docker/8ab25..1c/cpu.{shares,cfs_*} 1024 // cpu.shares (default value) 100000 // cpu.cfs_period_us (100ms period length) 100000 // cpu.cfs_quota_us (total CPU time in µs consumable per period) cat /sys/fs/cgroup/memory/docker/8ab25..1c/memory.limit_in_bytes 209715200 26

27. LIMITS: PROBLEMS 1. CPU CFS Quota: Latency 2. Memory: accounting, OOM behavior 27

28. PROBLEMS: LATENCY 28 https://github.com/zalando-incubator/kubernetes-on-aws/pull/923

29. PROBLEMS: HARDCODED PERIOD 29

30. PROBLEMS: HARDCODED PERIOD 30 https://github.com/kubernetes/kubernetes/issues/51135

31. NOW IN KUBERNETES 1.12 31 https://github.com/kubernetes/kubernetes/pull/63437

32. OVERLY AGGRESSIVE CFS Usage < Limit, but heavy throttling 32

33. OVERLY AGGRESSIVE CFS: EXPERIMENT #1 CPU Period: 100ms CPU Quota: None Burn 5ms and sleep 100ms ⇒ Quota disabled ⇒ No Throttling expected! 33 https://gist.github.com/bobrik/2030ff040fad360327a5fab7a09c4ff1

34. EXPERIMENT #1: NO QUOTA, NO THROTTLING 2018/11/03 13:04:02 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 6ms 2018/11/03 13:04:03 [1] burn took 5ms, real time so far: 510ms, cpu time so far: 11ms 2018/11/03 13:04:03 [2] burn took 5ms, real time so far: 1015ms, cpu time so far: 17ms 2018/11/03 13:04:04 [3] burn took 5ms, real time so far: 1520ms, cpu time so far: 23ms 2018/11/03 13:04:04 [4] burn took 5ms, real time so far: 2025ms, cpu time so far: 29ms 2018/11/03 13:04:05 [5] burn took 5ms, real time so far: 2530ms, cpu time so far: 35ms 2018/11/03 13:04:05 [6] burn took 5ms, real time so far: 3036ms, cpu time so far: 40ms 2018/11/03 13:04:06 [7] burn took 5ms, real time so far: 3541ms, cpu time so far: 46ms 2018/11/03 13:04:06 [8] burn took 5ms, real time so far: 4046ms, cpu time so far: 52ms 2018/11/03 13:04:07 [9] burn took 5ms, real time so far: 4551ms, cpu time so far: 58ms 34

35. OVERLY AGGRESSIVE CFS: EXPERIMENT #2 CPU Period: 100ms CPU Quota: 20ms Burn 5ms and sleep 500ms ⇒ No 100ms intervals where possibly 20ms is burned ⇒ No Throttling expected! 35

36. EXPERIMENT #2: OVERLY AGGRESSIVE CFS 2018/11/03 13:05:05 [0] burn took 5ms, real time so far: 5ms, cpu time so far: 5ms 2018/11/03 13:05:06 [1] burn took 99ms, real time so far: 690ms, cpu time so far: 9ms 2018/11/03 13:05:06 [2] burn took 99ms, real time so far: 1290ms, cpu time so far: 14ms 2018/11/03 13:05:07 [3] burn took 99ms, real time so far: 1890ms, cpu time so far: 18ms 2018/11/03 13:05:07 [4] burn took 5ms, real time so far: 2395ms, cpu time so far: 24ms 2018/11/03 13:05:08 [5] burn took 94ms, real time so far: 2990ms, cpu time so far: 27ms 2018/11/03 13:05:09 [6] burn took 99ms, real time so far: 3590ms, cpu time so far: 32ms 2018/11/03 13:05:09 [7] burn took 5ms, real time so far: 4095ms, cpu time so far: 37ms 2018/11/03 13:05:10 [8] burn took 5ms, real time so far: 4600ms, cpu time so far: 43ms 2018/11/03 13:05:10 [9] burn took 5ms, real time so far: 5105ms, cpu time so far: 49ms 36

37. OVERLY AGGRESSIVE CFS: EXPERIMENT #3 CPU Period: 10ms CPU Quota: 2ms Burn 5ms and sleep 100ms ⇒ Same 20% CPU (200m) limit, but smaller period ⇒ Throttling expected! 37

38. SMALLER CPU PERIOD ⇒ BETTER LATENCY 2018/11/03 16:31:07 [0] burn took 18ms, real time so far: 18ms, cpu time so far: 6ms 2018/11/03 16:31:07 [1] burn took 9ms, real time so far: 128ms, cpu time so far: 8ms 2018/11/03 16:31:07 [2] burn took 9ms, real time so far: 238ms, cpu time so far: 13ms 2018/11/03 16:31:07 [3] burn took 5ms, real time so far: 343ms, cpu time so far: 18ms 2018/11/03 16:31:07 [4] burn took 30ms, real time so far: 488ms, cpu time so far: 24ms 2018/11/03 16:31:07 [5] burn took 19ms, real time so far: 608ms, cpu time so far: 29ms 2018/11/03 16:31:07 [6] burn took 9ms, real time so far: 718ms, cpu time so far: 34ms 2018/11/03 16:31:08 [7] burn took 5ms, real time so far: 824ms, cpu time so far: 40ms 2018/11/03 16:31:08 [8] burn took 5ms, real time so far: 943ms, cpu time so far: 45ms 2018/11/03 16:31:08 [9] burn took 9ms, real time so far: 1068ms, cpu time so far: 48ms 38

39. INCIDENT INVOLVING CPU THROTTLING 39 https://k8s.af

40. LIMITS: VISIBILITY docker run --cpus 1 -m 200m --rm -it busybox top Mem: 7369128K used, 726072K free, 128164K CPU0: 14.8% usr 8.4% sys 0.2% nic 67.6% CPU1: 8.8% usr 10.3% sys 0.0% nic 75.9% CPU2: 7.3% usr 8.7% sys 0.0% nic 63.2% CPU3: 9.3% usr 9.9% sys 0.0% nic 65.7% 40 shrd, 303924K buff, 1208132K cached idle 8.2% io 0.0% irq 0.6% sirq idle 4.4% io 0.0% irq 0.4% sirq idle 20.1% io 0.0% irq 0.6% sirq idle 14.5% io 0.0% irq 0.4% sirq

41. LIMITS: VISIBILITY • Container-aware memory configuration • JVM MaxHeap • Container-aware processor configuration • Thread pools • GOMAXPROCS • node.js cluster module 41

42. KUBERNETES RESOURCES 42

43. ZALANDO: DECISION 1. Forbid Memory Overcommit • Implement mutating admission webhook • Set requests = limits 2. Disable CPU CFS Quota in all clusters • --cpu-cfs-quota=false 43

44. INGRESS LATENCY IMPROVEMENT 44

45. CLUSTER AUTOSCALER Simulates the Kubernetes scheduler internally to find out.. • ..if any of the pods wouldn’t fit on existing nodes ⇒ upscale is needed • ..if it’s possible to fit some of the pods on existing nodes ⇒ downscale is needed ⇒ Cluster size is determined by resource requests (+ constraints) 45 github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler

46. AUTOSCALING BUFFER • Cluster Autoscaler only triggers on Pending Pods • Node provisioning is slow ⇒ Reserve extra capacity via low priority Pods "Autoscaling Buffer Pods" 46

47. AUTOSCALING BUFFER kubectl describe pod autoscaling-buffer-..zjq5 -n kube-system ... Namespace: kube-system Evict if higher Priority: -1000000 priority (default) PriorityClassName: autoscaling-buffer Containers: Pod needs pause: capacity Image: teapot/pause-amd64:3.1 Requests: cpu: 1600m memory: 6871947673 47

48. ALLOCATABLE Reserve resources for system components, Kubelet, and container runtime: --system-reserved=\ cpu=100m,memory=164Mi --kube-reserved=\ cpu=100m,memory=282Mi 48

49. SLACK CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack Node CPU Memory "Slack" 49

50. STRANDED RESOURCES Node 1 CPU Some available capacity can become unusable / stranded. Memory Node 2 ⇒ Reschedule, bin packing CPU Memory 50 Stranded

51. MONITORING COST EFFICIENCY 51

52. KUBERNETES RESOURCE REPORT 52 github.com/hjacobs/kube-resource-report

53. RESOURCE REPORT: TEAMS Sorting teams by Slack Costs 53 github.com/hjacobs/kube-resource-report

54. RESOURCE REPORT: APPLICATIONS "Slack" 54

55. RESOURCE REPORT: APPLICATIONS 55

56. RESOURCE REPORT: CLUSTERS "Slack" 56 github.com/hjacobs/kube-resource-report

57. RESOURCE REPORT METRICS 57 github.com/hjacobs/kube-resource-report

58. KUBERNETES APPLICATION DASHBOARD 58

59. https://github.com/hjacobs/kube-ops-view

60. requested vs used https://github.com/hjacobs/kube-ops-view

61. OPTIMIZING COST EFFICIENCY 61

62. VERTICAL POD AUTOSCALER (VPA) "Some 2/3 of the (Google) Borg users use autopilot." - Tim Hockin VPA: Set resource requests automatically based on usage. 62

63. apiVersion: autoscaling.k8s.io/v1beta2 kind: VerticalPodAutoscaler metadata: ... spec: targetRef: apiVersion: apps/v1 kind: StatefulSet name: prometheus updatePolicy: { updateMode: Auto } resourcePolicy: containerPolicies: { containerName: prometheus } minAllowed: memory: 512Mi maxAllowed: memory: 10Gi 63 VPA FOR PROMETHEUS

64. VERTICAL POD AUTOSCALER limit/requests adapted by VPA 64

65. VERTICAL POD AUTOSCALER 65

66. HORIZONTAL POD AUTOSCALER apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: myapp spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 3 maxReplicas: 5 metrics: - type: Resource resource: name: cpu targetAverageUtilization: 100 66 ... target: ~100% of CPU requests

67. HORIZONTAL POD AUTOSCALING (CUSTOM METRICS) 67 Queue Length Ingress Req/s Prometheus Query ZMON Check github.com/zalando-incubator/kube-metrics-adapter

68. DOWNSCALING DURING OFF-HOURS Weekend 68 github.com/hjacobs/kube-downscaler

69. DOWNSCALING DURING OFF-HOURS DEFAULT_UPTIME="Mon-Fri 07:30-20:30 CET" annotations: downscaler/exclude: "true" 69 github.com/hjacobs/kube-downscaler

70. ACCUMULATED WASTE ● Prototypes ● Personal test environments ● Trial runs ● Decommissioned services ● Learning/training deployments Sounds familiar? 70

71. Example: Getting started with Zalenium & UI Tests Example: Step by step guide to the first UI test with Zalenium running in the Continuous Delivery Platform. I was always afraid of UI tests because it looked too difficult to get started, Zalenium solved this problem for me.

72. HOUSEKEEPING ● Delete prototypes after X days ● Clean up temporary deployments ● Remove resources without owner 72

73. KUBERNETES JANITOR ● TTL and expiry date annotations, e.g. ○ set time-to-live for your test deployment ● Custom rules, e.g. ○ delete everything without "app" label after 7 days 73 github.com/hjacobs/kube-janitor

74. JANITOR TTL ANNOTATION # let's try out nginx, but only for 1 hour kubectl run nginx --image=nginx kubectl annotate deploy nginx janitor/ttl=1h 74 github.com/hjacobs/kube-janitor

75. CUSTOM JANITOR RULES # require "app" label for new pods starting April 2019 - id: require-app-label-april-2019 resources: - deployments - statefulsets jmespath: "!(spec.template.metadata.labels.app) && metadata.creationTimestamp > '2019-04-01'" ttl: 7d 75 github.com/hjacobs/kube-janitor

76. EC2 SPOT NODES 72% savings 76

77. SPOT ASG / LAUNCH TEMPLATE 77 Not upstream in cluster-autoscaler (yet)

78. CLUSTER OVERHEAD: CONTROL PLANE ● GKE cluster: free ● EKS cluster: $146/month ● Zalando prod cluster: $635/month (etcd nodes + master nodes + ELB) Potential: fewer etcd nodes, no HA, shared control plane. 78

79. WHAT WORKED FOR US ● Disable CPU CFS Quota in all clusters ● Prevent memory overcommit ● Kubernetes Resource Report ● Downscaling during off-hours ● EC2 Spot 79

80. STABILITY ↔ EFFICIENCY Slack Autoscaling Buffer Disable Overcommit Cluster Overhead 80 Resource Report HPA VPA Downscaler Janitor EC2 Spot

81. OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler Kubernetes Janitor github.com/hjacobs/kube-janitor 81

82. OTHER TALKS/POSTS 82 • Everything You Ever Wanted to Know About Resource Scheduling • Inside Kubernetes Resource Management (QoS) - KubeCon 2018 • Setting Resource Requests and Limits in Kubernetes (Best Practices) • Effectively Managing Kubernetes Resources with Cost Monitoring

83. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k