Kubernetes Failure Stories1

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. Kubernetes Failure Stories MEETUP HAMBURG 2019-02-11 HENNING JACOBS @try_except_
2. ZALANDO AT A GLANCE ~ 4.5 billion EUR > 200 million revenue 2017 2 > 15.000 > 70% employees in Europe of visits via mobile devices visits per month > 300.000 > 24 product choices million ~ 2.000 17 brands countries active customers
3. SCALE 373 Accounts 100 3 Clusters
4. DEVELOPERS USING KUBERNETES 4
5. 46+ cluster components 5
6. POSTGRES OPERATOR Application to manage PostgreSQL clusters on Kubernetes >500 clusters running on Kubernetes 6 https://github.com/zalando-incubator/postgres-operator
7. INCIDENTS ARE FINE
8. INCIDENT #1
9. #1: LESS THAN 20% OF NODES AVAILABLE NAME ip-172-31-10-91...internal ip-172-31-11-16...internal ip-172-31-11-211...internal ip-172-31-15-46...internal ip-172-31-18-123...internal ip-172-31-19-46...internal ip-172-31-19-75...internal ip-172-31-2-124...internal ip-172-31-3-58...internal ip-172-31-5-211...internal ip-172-31-7-147...internal 9 STATUS NotReady NotReady Ready,SchedulingDisabled Ready NotReady Ready NotReady NotReady Ready Ready Ready,SchedulingDisabled AGE 4d 4d 5d 4d 4d 4d 4d 4d 4d 4d 5d VERSION v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0 v1.7.4+coreos.0
10. TRAIL OF CLUES • Recovered automatically after 15 minutes • Nodes unhealthy at same time, recover at same time • API server is behind AWS ELB • Seems to happen to others, too • Some report it happening ~every month 10
11. UPSTREAM ISSUE ⇒ Fixed in 1.8 (backported to 1.7.8) 11 https://github.com/kubernetes/kubernetes/issues/48638
12. INCIDENT #2
13. INCIDENT #2: CUSTOMER IMPACT 13
14. INCIDENT #2: IAM RETURNING 404 14
15. INCIDENT #2: NUMBER OF PODS 15
16. LIFE OF A REQUEST (INGRESS) TLS EC2 network ALB K8s network HTTP Node 16 MyApp Skipper MyApp Node MyApp Skipper
17. ROUTES FROM API SERVER ALB API Server Node 17 MyApp Skipper MyApp Node MyApp Skipper
18. API SERVER DOWN ALB API Server OOMKill Node 18 MyApp Skipper MyApp Node MyApp Skipper
19. INCIDENT #2: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: 19 ...
20. INCIDENT #2: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: spec: restartPolicy: Never 20 containers:
21. INCIDENT #2: CONTRIBUTING FACTORS 21 • Wrong CronJob manifest and no automatic job cleanup • Reliance on Kubernetes API server availability • Ingress routes not kept as-is in case of outage • No quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500"
22. INCIDENT #3
23. INCIDENT #3: INGRESS ERRORS 23
24. INCIDENT #3: COREDNS OOMKILL coredns invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=994 Memory cgroup out of memory: Kill process 6428 (coredns) score 2050 or sacrifice child oom_reaper: reaped process 6428 (coredns), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 24 restarts
25. STOP THE BLEEDING: INCREASE MEMORY LIMIT 4Gi 2Gi 200Mi 25
26. SPIKE IN HTTP REQUESTS 26
27. SPIKE IN DNS QUERIES 27
28. INCREASE IN MEMORY USAGE 28
29. INCIDENT #3: CONTRIBUTING FACTORS • HTTP retries • No DNS caching • Kubernetes ndots:5 problem • Short maximum lifetime of HTTP connections • Fixed memory limit for CoreDNS • Monitoring affected by DNS outage 29 github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
30. INCIDENT #4
31. #4: KERNEL OOM KILLER ⇒ all containers on this node down 31
32. INCIDENT #4: KUBELET MEMORY 32
33. UPSTREAM ISSUE REPORTED https://github.com/kubernetes/kubernetes/issues/73587 33
34. INCIDENT #4: THE PATCH 34 https://github.com/kubernetes/kubernetes/issues/73587
35. INCIDENT #5
36. INCIDENT #5: IMPACT Error during Pod creation: MountVolume.SetUp failed for volume "outfit-delivery-api-credentials" : secrets "outfit-delivery-api-credentials" not found ⇒ All new Kubernetes deployments fail 36
37. INCIDENT #5: CREDENTIALS QUEUE 17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20 17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20 17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20 .. 17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20 .. 17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20 .. 19:16:07 | [pool-6-thread-1 37 ] | Current queue size: 58381, current number of active workers: 20
38. INCIDENT #5: CPU THROTTLING 38
39. INCIDENT #5: WHAT HAPPENED Scaled down IAM provider to reduce Slack + Number of deployments increased ⇒ Process could not process credentials fast enough 39
40. SLACK CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack Node CPU Memory "Slack" 40
41. DISABLING CPU THROTTLING kubelet … --cpu-cfs-quota=false [Announcement] CPU limits will be disabled ⇒ Ingress Latency Improvements 41
42. MANAGED KUBERNETES? 42
43. WILL MANAGED K8S SAVE US? GKE: monthly uptime percentage at 99.95% for regional clusters 43
44. WILL MANAGED K8S SAVE US? NO (not really) e.g. AWS EKS uptime SLA is only for API server 44
45. PRODUCTION PROOFING AWS EKS List of things you might want to look at for EKS in production https://medium.com/glia-tech/productionproofing-e ks-ed52951ffd6c 45
46. AWS EKS IN PRODUCTION https://kubedex.com/90-days-of-aws-eks-in-production/ 46
47. DOCKER.. (ON GKE) 47 https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0 39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
48. WELCOME TO CLOUD NATIVE!
49. 49
50.
51. KUBERNETES FAILURE STORIES 20 failure stories so far What about yours? github.com/hjacobs/kubernetes-failure-stories 51
52. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-03 04:36
浙ICP备14020137号-1 $访客地图$