Kubernetes Failure Stories2
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. Kubernetes
Failure Stories
HENNING JACOBS
@try_except_
2.
3.
4. ZALANDO AT A GLANCE
~ 5.4
billion EUR
> 250
million
revenue 2018
4
> 15.000 > 79%
employees in
Europe of visits via
mobile devices
visits
per
month
> 300.000
> 26 product choices
million ~ 2.000 17
brands countries
active customers
5. SCALE
380
Accounts
118
5
Clusters
6. DEVELOPERS USING KUBERNETES
6
7. 47+ cluster
components
7
8. INCIDENTS ARE FINE
9. INCIDENT
#1
10. INCIDENT #1: CUSTOMER IMPACT
10
11. INCIDENT #1: CUSTOMER IMPACT
11
12. INCIDENT #1: INGRESS ERRORS
12
13. INCIDENT #1: AWS ALB 502
13
github.com/zalando/riptide
14. INCIDENT #1: AWS ALB 502
502 Bad Gateway
Server: awselb/2.0
...
14
github.com/zalando/riptide
15. INCIDENT #1: ALB HEALTHY HOST COUNT
3 healthy hosts
2xx requests
zero healthy hosts
15
16. LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB
K8s network
HTTP
Node
16
MyApp
Skipper
MyApp
Node
MyApp
Skipper
17. INCIDENT #1: SKIPPER MEMORY USAGE
Memory Usage
Memory Limit
17
18. INCIDENT #1: SKIPPER OOM
TLS
ALB
HTTP
Node
18
MyApp
Skipper
MyApp
Node
MyApp
Skipper
OOMKill
19. INCIDENT #1: CONTRIBUTING FACTORS
• Shared Ingress (per cluster)
• High latency of unrelated app (Solr)
caused high number of in-flight requests
• Skipper creates goroutine per HTTP request.
Goroutine costs 2kB memory + http.Request
• Memory limit was fixed at 500Mi (4x regular usage)
19
Fix for the memory issue in Skipper:
https://opensource.zalando.com/skipper/operation/operation/#scheduler
20. INCIDENT
#2
21. INCIDENT #2: CUSTOMER IMPACT
21
22. INCIDENT #1: IAM RETURNING 404
22
23. INCIDENT #1: NUMBER OF PODS
23
24. LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB
K8s network
HTTP
Node
24
MyApp
Skipper
MyApp
Node
MyApp
Skipper
25. ROUTES FROM API SERVER
ALB
API Server
Node
25
MyApp
Skipper
MyApp
Node
MyApp
Skipper
26. API SERVER DOWN
ALB
API Server
OOMKill
Node
26
MyApp
Skipper
MyApp
Node
MyApp
Skipper
27. INCIDENT #2: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
27
...
28. INCIDENT #2: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
restartPolicy: Never
28
containers:
29. INCIDENT #2: LESSONS LEARNED
• Fix Ingress to stay “healthy” during API server problems
• Fix Ingress to retain last known set of routes
• Use quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
29
NOTE: we dropped quotas recently
github.com/zalando-incubator/kubernetes-
on-aws/pull/2059
30. INCIDENT
#3
31. INCIDENT #3: INGRESS ERRORS
31
32. INCIDENT #3: COREDNS OOMKILL
coredns invoked oom-killer:
gfp_mask=0x14000c0(GFP_KERNEL),
nodemask=(null), order=0, oom_score_adj=994
Memory cgroup out of memory: Kill process 6428
(coredns) score 2050 or sacrifice child
oom_reaper: reaped process 6428 (coredns),
now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
32
restarts
33. STOP THE BLEEDING: INCREASE MEMORY LIMIT
4Gi
2Gi
200Mi
33
34. SPIKE IN HTTP REQUESTS
34
35. SPIKE IN DNS QUERIES
35
36. INCREASE IN MEMORY USAGE
36
37. INCIDENT #3: CONTRIBUTING FACTORS
• HTTP retries
• No DNS caching
• Kubernetes ndots:5 problem
• Short maximum lifetime of HTTP connections
• Fixed memory limit for CoreDNS
• Monitoring affected by DNS outage
37
github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
38. INCIDENT
#4
39. INCIDENT #4: CLUSTER DOWN
39
40. INCIDENT #4: MANUAL OPERATION
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
40
41. INCIDENT #4: RTFM
% etcdctl del -r /registry-kube-1/certificatesigningrequest prefix
help: etcdctl del [options] <key> [range_end]
41
42. 42
Junior Engineers are Features, not Bugs
https://www.youtube.com/watch?v=cQta4G3ge44
43. https://www.outcome-eng.com/human-error-never-root-cause/
44. INCIDENT #4: LESSONS LEARNED
• Disaster Recovery Plan?
• Backup etcd to S3
• Monitor the snapshots
44
45. INCIDENT
#5
46. INCIDENT #5: API LATENCY SPIKES
46
47. INCIDENT #5: CONNECTION ISSUES
Master Node
API Server
etcd
etcd-member
...
Kubernetes worker and master nodes sporadically fail to connect to etcd
causing timeouts in the APIserver and disconnects in the pod network.
...
47
48. INCIDENT #5: STOP THE BLEEDING
#!/bin/bash
while true; do
echo "sleep for 60 seconds"
sleep 60
timeout 5 curl http://localhost:8080/api/v1/nodes > /dev/null
if [ $? -eq 0 ]; then
echo "all fine, no need to restart etcd member"
continue
else
echo "restarting etcd-member"
systemctl restart etcd-member
fi
done
48
49. INCIDENT #5: CONFIRMATION FROM AWS
[...]
We can’t go into the details [...] that resulted the networking problems during
the “non-intrusive maintenance”, as it relates to internal workings of EC2.
We can confirm this only affected the T2 instance types, ...
[...]
We don’t explicitly recommend against running production services on T2
[...]
49
50. INCIDENT #5: LESSONS LEARNED
• It's never the AWS infrastructure until it is
• Treat t2 instances with care
• Kubernetes components are not necessarily "cloud native"
Cloud Native? Declarative, dynamic, resilient, and scalable
50
51. INCIDENT
#6
52. INCIDENT #6: IMPACT
Ingress
5XXs
52
53. INCIDENT #6: CLUSTER DOWN?
53
54. INCIDENT #6: THE TRIGGER
54
55. https://www.outcome-eng.com/human-error-never-root-cause/
56. CLUSTER UPGRADE
FLOW
56
57. CLUSTER LIFECYCLE MANAGER (CLM)
github.com/zalando-incubator/cluster-lifecycle-manager
57
58. CLUSTER CHANNELS
Channel Description
Development and playground clusters. 3
alpha Main infrastructure clusters (important to us). 2
beta Product clusters for the rest of the
organization (non-prod). 57+
stable Product clusters for the rest of the
organization (prod). 57+
dev
58
Clusters
github.com/zalando-incubator/kubernetes-on-aws
59. E2E TESTS ON EVERY PR
github.com/zalando-incubator/kubernetes-on-aws
59
60. RUNNING E2E TESTS (BEFORE)
Testing dev to alpha upgrade
branch: dev
Control plane Control plane
Run e2e tests Delete Cluster
Control plane
node
node
Create Cluster
60
61. RUNNING E2E TESTS (NOW)
Testing dev to alpha upgrade
branch: alpha (base)
branch: dev (head)
node
Create Cluster
61
Control plane
Run e2e tests Delete Cluster
Control plane
Control plane
node
Control plane
node
node
Update Cluster
62. INCIDENT #6: LESSONS LEARNED
• Automated e2e tests are pretty good, but not enough
• Test the diff/migration automatically
• Bootstrap new cluster with previous configuration
• Apply new configuration
• Run end-to-end & conformance tests
62
github.com/zalando-incubator/kubernetes-on-aws/tree/dev/test/e2e
63. INCIDENT
#7
64. #7: KERNEL OOM KILLER
⇒ all containers
on this node down
64
65. INCIDENT #7: KUBELET MEMORY
65
66. UPSTREAM ISSUE REPORTED
https://github.com/kubernetes/kubernetes/issues/73587
66
67. INCIDENT #7: THE PATCH
67
https://github.com/kubernetes/kubernetes/issues/73587
68. INCIDENT
#8
69. INCIDENT #8: IMPACT
Error during Pod creation:
MountVolume.SetUp failed for volume
"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found
⇒ All new Kubernetes deployments fail
69
70. INCIDENT #8: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1
] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1
] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1
70
] | Current queue size: 58381, current number of active workers: 20
71. INCIDENT #8: CPU THROTTLING
71
72. INCIDENT #8: WHAT HAPPENED
Scaled down IAM provider
to reduce Slack
+ Number of deployments increased
⇒ Process could not process credentials fast enough
72
73. SLACK
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
Node
CPU
Memory
"Slack"
73
74. DISABLING CPU THROTTLING
kubelet … --cpu-cfs-quota=false
[Announcement] CPU limits will be disabled
⇒ Ingress Latency Improvements
74
75. A MILLION WAYS TO CRASH YOUR CLUSTER?
75
• Switch to latest Docker to fix issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s),
switch from kube-dns to node-local dnsmasq+CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy: client-go still seems to have
issues with timeouts
• 502's during cluster updates: race condition during network setup
76. MORE TOPICS
76
• Graceful Pod shutdown and
race conditions (endpoints, Ingress)
• Incompatible Kubernetes changes
• CoreOS ContainerLinux "stable" won't boot
• Kubernetes EBS volume handling
• Docker
77. RACE CONDITIONS..
• Switch to the latest Docker version available to fix the issues with Docker daemon freezing
• Redesign of DNS setup due to high DNS latencies (5s), switch from kube-dns to CoreDNS
• Disabling CPU throttling (CFS quota) to avoid latency issues
• Quick fix for timeouts using etcd-proxy, since client-go still seems to have issues with timeouts
• 502's during cluster updates: race condition
•
77
github.com/zalando-incubator/kubernetes-on-aws
78. TIMEOUTS TO API SERVER..
github.com/zalando-incubator/kubernetes-on-aws
78
79. MANAGED
KUBERNETES?
79
80. WILL MANAGED K8S SAVE US?
GKE: monthly uptime percentage at 99.95% for regional clusters
80
81. WILL MANAGED K8S SAVE US?
NO
(not really)
e.g. AWS EKS uptime SLA is only for API server
81
82. PRODUCTION PROOFING AWS EKS
List of things you might
want to look at for EKS
in production
https://medium.com/glia-tech/productionproofing-e
ks-ed52951ffd6c
82
83. AWS EKS IN PRODUCTION
https://kubedex.com/90-days-of-aws-eks-in-production/
83
84. DOCKER.. (ON GKE)
84
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
85. WELCOME TO
CLOUD NATIVE!
86. 86
87. KUBERNETES FAILURE STORIES
A compiled list of links to public failure stories related to Kubernetes.
k8s.af
We need more failure talks!
87
Istio? Anyone?
88. OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
Skipper HTTP Router & Ingress controller
github.com/zalando/skipper
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando-incubator/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
88
89. QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k