Kubernetes Failure Stories2

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. Kubernetes Failure Stories HENNING JACOBS @try_except_
2.
3.
4. ZALANDO AT A GLANCE ~ 5.4 billion EUR > 250 million revenue 2018 4 > 15.000 > 79% employees in Europe of visits via mobile devices visits per month > 300.000 > 26 product choices million ~ 2.000 17 brands countries active customers
5. SCALE 380 Accounts 118 5 Clusters
6. DEVELOPERS USING KUBERNETES 6
7. 47+ cluster components 7
8. INCIDENTS ARE FINE
9. INCIDENT #1
10. INCIDENT #1: CUSTOMER IMPACT 10
11. INCIDENT #1: CUSTOMER IMPACT 11
12. INCIDENT #1: INGRESS ERRORS 12
13. INCIDENT #1: AWS ALB 502 13 github.com/zalando/riptide
14. INCIDENT #1: AWS ALB 502 502 Bad Gateway Server: awselb/2.0 ... 14 github.com/zalando/riptide
15. INCIDENT #1: ALB HEALTHY HOST COUNT 3 healthy hosts 2xx requests zero healthy hosts 15
16. LIFE OF A REQUEST (INGRESS) TLS EC2 network ALB K8s network HTTP Node 16 MyApp Skipper MyApp Node MyApp Skipper
17. INCIDENT #1: SKIPPER MEMORY USAGE Memory Usage Memory Limit 17
18. INCIDENT #1: SKIPPER OOM TLS ALB HTTP Node 18 MyApp Skipper MyApp Node MyApp Skipper OOMKill
19. INCIDENT #1: CONTRIBUTING FACTORS • Shared Ingress (per cluster) • High latency of unrelated app (Solr) caused high number of in-flight requests • Skipper creates goroutine per HTTP request. Goroutine costs 2kB memory + http.Request • Memory limit was fixed at 500Mi (4x regular usage) 19 Fix for the memory issue in Skipper: https://opensource.zalando.com/skipper/operation/operation/#scheduler
20. INCIDENT #2
21. INCIDENT #2: CUSTOMER IMPACT 21
22. INCIDENT #1: IAM RETURNING 404 22
23. INCIDENT #1: NUMBER OF PODS 23
24. LIFE OF A REQUEST (INGRESS) TLS EC2 network ALB K8s network HTTP Node 24 MyApp Skipper MyApp Node MyApp Skipper
25. ROUTES FROM API SERVER ALB API Server Node 25 MyApp Skipper MyApp Node MyApp Skipper
26. API SERVER DOWN ALB API Server OOMKill Node 26 MyApp Skipper MyApp Node MyApp Skipper
27. INCIDENT #2: INNOCENT MANIFEST apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "*/15 9-19 * * Mon-Fri" jobTemplate: spec: template: spec: restartPolicy: Never concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 containers: 27 ...
28. INCIDENT #2: FIXED CRON JOB apiVersion: batch/v2alpha1 kind: CronJob metadata: name: "foobar" spec: schedule: "7 8-18 * * Mon-Fri" concurrencyPolicy: Forbid successfulJobsHistoryLimit: 1 failedJobsHistoryLimit: 1 jobTemplate: spec: activeDeadlineSeconds: 120 template: spec: restartPolicy: Never 28 containers:
29. INCIDENT #2: LESSONS LEARNED • Fix Ingress to stay “healthy” during API server problems • Fix Ingress to retain last known set of routes • Use quota for number of pods apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: pods: "1500" 29 NOTE: we dropped quotas recently github.com/zalando-incubator/kubernetes- on-aws/pull/2059
30. INCIDENT #3
31. INCIDENT #3: INGRESS ERRORS 31
32. INCIDENT #3: COREDNS OOMKILL coredns invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=994 Memory cgroup out of memory: Kill process 6428 (coredns) score 2050 or sacrifice child oom_reaper: reaped process 6428 (coredns), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 32 restarts
33. STOP THE BLEEDING: INCREASE MEMORY LIMIT 4Gi 2Gi 200Mi 33
34. SPIKE IN HTTP REQUESTS 34
35. SPIKE IN DNS QUERIES 35
36. INCREASE IN MEMORY USAGE 36
37. INCIDENT #3: CONTRIBUTING FACTORS • HTTP retries • No DNS caching • Kubernetes ndots:5 problem • Short maximum lifetime of HTTP connections • Fixed memory limit for CoreDNS • Monitoring affected by DNS outage 37 github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
38. INCIDENT #4
39. INCIDENT #4: CLUSTER DOWN 39
40. INCIDENT #4: MANUAL OPERATION % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix 40
41. INCIDENT #4: RTFM % etcdctl del -r /registry-kube-1/certificatesigningrequest prefix help: etcdctl del [options] <key> [range_end] 41
42. 42 Junior Engineers are Features, not Bugs https://www.youtube.com/watch?v=cQta4G3ge44
43. https://www.outcome-eng.com/human-error-never-root-cause/
44. INCIDENT #4: LESSONS LEARNED • Disaster Recovery Plan? • Backup etcd to S3 • Monitor the snapshots 44
45. INCIDENT #5
46. INCIDENT #5: API LATENCY SPIKES 46
47. INCIDENT #5: CONNECTION ISSUES Master Node API Server etcd etcd-member ... Kubernetes worker and master nodes sporadically fail to connect to etcd causing timeouts in the APIserver and disconnects in the pod network. ... 47
48. INCIDENT #5: STOP THE BLEEDING #!/bin/bash while true; do echo "sleep for 60 seconds" sleep 60 timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null if [ $? -eq 0 ]; then echo "all fine, no need to restart etcd member" continue else echo "restarting etcd-member" systemctl restart etcd-member fi done 48
49. INCIDENT #5: CONFIRMATION FROM AWS [...] We can’t go into the details [...] that resulted the networking problems during the “non-intrusive maintenance”, as it relates to internal workings of EC2. We can confirm this only affected the T2 instance types, ... [...] We don’t explicitly recommend against running production services on T2 [...] 49
50. INCIDENT #5: LESSONS LEARNED • It's never the AWS infrastructure until it is • Treat t2 instances with care • Kubernetes components are not necessarily "cloud native" Cloud Native? Declarative, dynamic, resilient, and scalable 50
51. INCIDENT #6
52. INCIDENT #6: IMPACT Ingress 5XXs 52
53. INCIDENT #6: CLUSTER DOWN? 53
54. INCIDENT #6: THE TRIGGER 54
55. https://www.outcome-eng.com/human-error-never-root-cause/
56. CLUSTER UPGRADE FLOW 56
57. CLUSTER LIFECYCLE MANAGER (CLM) github.com/zalando-incubator/cluster-lifecycle-manager 57
58. CLUSTER CHANNELS Channel Description Development and playground clusters. 3 alpha Main infrastructure clusters (important to us). 2 beta Product clusters for the rest of the organization (non-prod). 57+ stable Product clusters for the rest of the organization (prod). 57+ dev 58 Clusters github.com/zalando-incubator/kubernetes-on-aws
59. E2E TESTS ON EVERY PR github.com/zalando-incubator/kubernetes-on-aws 59
60. RUNNING E2E TESTS (BEFORE) Testing dev to alpha upgrade branch: dev Control plane Control plane Run e2e tests Delete Cluster Control plane node node Create Cluster 60
61. RUNNING E2E TESTS (NOW) Testing dev to alpha upgrade branch: alpha (base) branch: dev (head) node Create Cluster 61 Control plane Run e2e tests Delete Cluster Control plane Control plane node Control plane node node Update Cluster
62. INCIDENT #6: LESSONS LEARNED • Automated e2e tests are pretty good, but not enough • Test the diff/migration automatically • Bootstrap new cluster with previous configuration • Apply new configuration • Run end-to-end & conformance tests 62 github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
63. INCIDENT #7
64. #7: KERNEL OOM KILLER ⇒ all containers on this node down 64
65. INCIDENT #7: KUBELET MEMORY 65
66. UPSTREAM ISSUE REPORTED https://github.com/kubernetes/kubernetes/issues/73587 66
67. INCIDENT #7: THE PATCH 67 https://github.com/kubernetes/kubernetes/issues/73587
68. INCIDENT #8
69. INCIDENT #8: IMPACT Error during Pod creation: MountVolume.SetUp failed for volume "outfit-delivery-api-credentials" : secrets "outfit-delivery-api-credentials" not found ⇒ All new Kubernetes deployments fail 69
70. INCIDENT #8: CREDENTIALS QUEUE 17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20 17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20 17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20 .. 17:37:07 | [pool-6-thread-1 ] | Current queue size: 9686, current number of active workers: 20 .. 17:44:07 | [pool-6-thread-1 ] | Current queue size: 11976, current number of active workers: 20 .. 19:16:07 | [pool-6-thread-1 70 ] | Current queue size: 58381, current number of active workers: 20
71. INCIDENT #8: CPU THROTTLING 71
72. INCIDENT #8: WHAT HAPPENED Scaled down IAM provider to reduce Slack + Number of deployments increased ⇒ Process could not process credentials fast enough 72
73. SLACK CPU/memory requests "block" resources on nodes. Difference between actual usage and requests → Slack Node CPU Memory "Slack" 73
74. DISABLING CPU THROTTLING kubelet … --cpu-cfs-quota=false [Announcement] CPU limits will be disabled ⇒ Ingress Latency Improvements 74
75. A MILLION WAYS TO CRASH YOUR CLUSTER? 75 • Switch to latest Docker to fix issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to node-local dnsmasq+CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy: client-go still seems to have issues with timeouts • 502's during cluster updates: race condition during network setup
76. MORE TOPICS 76 • Graceful Pod shutdown and race conditions (endpoints, Ingress) • Incompatible Kubernetes changes • CoreOS ContainerLinux "stable" won't boot • Kubernetes EBS volume handling • Docker
77. RACE CONDITIONS.. • Switch to the latest Docker version available to fix the issues with Docker daemon freezing • Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS • Disabling CPU throttling (CFS quota) to avoid latency issues • Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts • 502's during cluster updates: race condition • 77 github.com/zalando-incubator/kubernetes-on-aws
78. TIMEOUTS TO API SERVER.. github.com/zalando-incubator/kubernetes-on-aws 78
79. MANAGED KUBERNETES? 79
80. WILL MANAGED K8S SAVE US? GKE: monthly uptime percentage at 99.95% for regional clusters 80
81. WILL MANAGED K8S SAVE US? NO (not really) e.g. AWS EKS uptime SLA is only for API server 81
82. PRODUCTION PROOFING AWS EKS List of things you might want to look at for EKS in production https://medium.com/glia-tech/productionproofing-e ks-ed52951ffd6c 82
83. AWS EKS IN PRODUCTION https://kubedex.com/90-days-of-aws-eks-in-production/ 83
84. DOCKER.. (ON GKE) 84 https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0 39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
85. WELCOME TO CLOUD NATIVE!
86. 86
87. KUBERNETES FAILURE STORIES A compiled list of links to public failure stories related to Kubernetes. k8s.af We need more failure talks! 87 Istio? Anyone?
88. OPEN SOURCE Kubernetes on AWS github.com/zalando-incubator/kubernetes-on-aws AWS ALB Ingress controller github.com/zalando-incubator/kube-ingress-aws-controller Skipper HTTP Router & Ingress controller github.com/zalando/skipper External DNS github.com/kubernetes-incubator/external-dns Postgres Operator github.com/zalando-incubator/postgres-operator Kubernetes Resource Report github.com/hjacobs/kube-resource-report Kubernetes Downscaler github.com/hjacobs/kube-downscaler 88
89. QUESTIONS? HENNING JACOBS HEAD OF DEVELOPER PRODUCTIVITY henning@zalando.de @try_except_ Illustrations by @01k

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-03 04:53
浙ICP备14020137号-1 $访客地图$