Kubernetes Failure Stories1

1. Kubernetes Failure Stories MEETUP HAMBURG 2019-02-11 HENNING JACOBS @try_except_

2. ZALANDO AT A GLANCE ~ 4.5 billion EUR > 200 million revenue 2017 2 > 15.000 > 70% employees in Europe of visits via mobile devices visits per month > 300.000 > 24 product choices million ~ 2.000 17 brands countries active customers

3. SCALE 373 Accounts 100 3 Clusters

4. DEVELOPERS USING KUBERNETES 4

5. 46+ cluster components 5

6. POSTGRES OPERATOR Application to manage PostgreSQL clusters on Kubernetes >500 clusters running on Kubernetes 6 https://github.com/zalando-incubator/postgres-operator

7. INCIDENTS ARE FINE

8. INCIDENT #1

9. #1: LESS THAN 20% OF NODES AVAILABLE NAME ip-172-31-10-91...internal ip-172-31-11-16...internal ip-172-31-11-211...internal ip-172-31-15-46...internal ip-172-31-18-123...internal ip-172-31-19-46...internal ip-172-31-19-75...internal ip-172-31-2-124...internal ip-172-31-3-58...internal ip-172-31-5-211...internal ip-172-31-7-147...internal 9 STATUS NotReady NotReady Ready,SchedulingDisabled Ready NotReady Ready NotReady NotReady Ready Ready Ready,SchedulingDisabled AGE 4d 4d 5d 4d 4d 4d 4d 4d 4d 4d 5d VERSION v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0

10. TRAIL OF CLUES • Recovered automatically after 15 minutes • Nodes unhealthy at same time, recover at same time • API server is behind AWS ELB • Seems to happen to others, too • Some report it happening ~every month 10

11. UPSTREAM ISSUE ⇒ Fixed in 1.8 (backported to 1.7.8) 11 https://github.com/kubernetes/kubernetes/issues/48638

12. INCIDENT #2

13. INCIDENT #2: CUSTOMER IMPACT 13

14. INCIDENT #2: IAM RETURNING 404 14

15. INCIDENT #2: NUMBER OF PODS 15

16. LIFE OF A REQUEST (INGRESS) TLS EC2 network ALB K8s network HTTP Node 16 MyApp Skipper MyApp Node MyApp Skipper

17. ROUTES FROM API SERVER ALB API Server Node 17 MyApp Skipper MyApp Node MyApp Skipper

18. API SERVER DOWN ALB API Server OOMKill Node 18 MyApp Skipper MyApp Node MyApp Skipper

19. INCIDENT #2: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: 19 ...

20. INCIDENT #2: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: spec: restartPolicy: Never 20 containers:

21. INCIDENT #2: CONTRIBUTING FACTORS 21 • Wrong CronJob manifest and no automatic job cleanup • Reliance on Kubernetes API server availability • Ingress routes not kept as-is in case of outage • No quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"

22. INCIDENT #3

23. INCIDENT #3: INGRESS ERRORS 23

24. INCIDENT #3: COREDNS OOMKILL coredns invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=994 Memory cgroup out of memory: Kill process 6428 (coredns) score 2050 or sacrifice child oom_reaper: reaped process 6428 (coredns), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 24 restarts

25. STOP THE BLEEDING: INCREASE MEMORY LIMIT 4Gi 2Gi 200Mi 25

26. SPIKE IN HTTP REQUESTS 26

27. SPIKE IN DNS QUERIES 27

28. INCREASE IN MEMORY USAGE 28

29. INCIDENT #3: CONTRIBUTING FACTORS • HTTP retries • No DNS caching • Kubernetes ndots:5 problem • Short maximum lifetime of HTTP connections • Fixed memory limit for CoreDNS • Monitoring affected by DNS outage 29 github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md

30. INCIDENT #4

31. #4: KERNEL OOM KILLER ⇒ all containers on this node down 31

32. INCIDENT #4: KUBELET MEMORY 32

33. UPSTREAM ISSUE REPORTED https://github.com/kubernetes/kubernetes/issues/73587 33

34. INCIDENT #4: THE PATCH 34 https://github.com/kubernetes/kubernetes/issues/73587

35. INCIDENT #5

36. INCIDENT #5: IMPACT Error during Pod creation: MountVolume.SetUp failed for volume "outfit-delivery-api-credentials" : secrets "outfit-delivery-api-credentials" not found ⇒ All new Kubernetes deployments fail 36

37. INCIDENT #5: CREDENTIALS QUEUE 17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20 17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20 17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20 .. 17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20 .. 17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20 .. 19:16:07 | [pool-6-thread-1 37 ] | Current queue size: 58381, current number of active workers: 20

38. INCIDENT #5: CPU THROTTLING 38

39. INCIDENT #5: WHAT HAPPENED Scaled down IAM provider to reduce Slack + Number of deployments increased ⇒ Process could not process credentials fast enough 39

40. SLACK CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack Node CPU Memory "Slack" 40

41. DISABLING CPU THROTTLING kubelet … --cpu-cfs-quota=false [Announcement] CPU limits will be disabled ⇒ Ingress Latency Improvements 41

42. MANAGED KUBERNETES? 42

43. WILL MANAGED K8S SAVE US? GKE: monthly uptime percentage at 99.95% for regional clusters 43

44. WILL MANAGED K8S SAVE US? NO (not really) e.g. AWS EKS uptime SLA is only for API server 44

45. PRODUCTION PROOFING AWS EKS List of things you might want to look at for EKS in production https://medium.com/glia-tech/productionproofing-e ks-ed52951ffd6c 45

46. AWS EKS IN PRODUCTION https://kubedex.com/90-days-of-aws-eks-in-production/ 46

47. DOCKER.. (ON GKE) 47 https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0 39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29

48. WELCOME TO CLOUD NATIVE!

49. 49

50.

51. KUBERNETES FAILURE STORIES 20 failure stories so far What about yours? github.com/hjacobs/kubernetes-failure-stories 51

52. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k