Kubernetes Failure Stories1
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. Kubernetes
Failure
Stories
MEETUP
HAMBURG
2019-02-11
HENNING JACOBS
@try_except_
2. ZALANDO AT A GLANCE
~ 4.5
billion EUR
> 200
million
revenue 2017
2
> 15.000 > 70%
employees in
Europe of visits via
mobile devices
visits
per
month
> 300.000
> 24 product choices
million ~ 2.000 17
brands countries
active customers
3. SCALE
373
Accounts
100
3
Clusters
4. DEVELOPERS USING KUBERNETES
4
5. 46+ cluster
components
5
6. POSTGRES OPERATOR
Application to manage
PostgreSQL clusters on
Kubernetes
>500
clusters running
on Kubernetes
6
https://github.com/zalando-incubator/postgres-operator
7. INCIDENTS ARE FINE
8. INCIDENT
#1
9. #1: LESS THAN 20% OF NODES AVAILABLE
NAME
ip-172-31-10-91...internal
ip-172-31-11-16...internal
ip-172-31-11-211...internal
ip-172-31-15-46...internal
ip-172-31-18-123...internal
ip-172-31-19-46...internal
ip-172-31-19-75...internal
ip-172-31-2-124...internal
ip-172-31-3-58...internal
ip-172-31-5-211...internal
ip-172-31-7-147...internal
9
STATUS
NotReady
NotReady
Ready,SchedulingDisabled
Ready
NotReady
Ready
NotReady
NotReady
Ready
Ready
Ready,SchedulingDisabled
AGE
4d
4d
5d
4d
4d
4d
4d
4d
4d
4d
5d
VERSION
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
v1.7.4+coreos.0
10. TRAIL OF CLUES
• Recovered automatically after 15 minutes
• Nodes unhealthy at same time, recover at same time
• API server is behind AWS ELB
• Seems to happen to others, too
• Some report it happening ~every month
10
11. UPSTREAM ISSUE
⇒ Fixed in 1.8 (backported to 1.7.8)
11
https://github.com/kubernetes/kubernetes/issues/48638
12. INCIDENT
#2
13. INCIDENT #2: CUSTOMER IMPACT
13
14. INCIDENT #2: IAM RETURNING 404
14
15. INCIDENT #2: NUMBER OF PODS
15
16. LIFE OF A REQUEST (INGRESS)
TLS
EC2 network
ALB
K8s network
HTTP
Node
16
MyApp
Skipper
MyApp
Node
MyApp
Skipper
17. ROUTES FROM API SERVER
ALB
API Server
Node
17
MyApp
Skipper
MyApp
Node
MyApp
Skipper
18. API SERVER DOWN
ALB
API Server
OOMKill
Node
18
MyApp
Skipper
MyApp
Node
MyApp
Skipper
19. INCIDENT #2: INNOCENT MANIFEST
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "*/15 9-19 * * Mon-Fri"
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
containers:
19
...
20. INCIDENT #2: FIXED CRON JOB
apiVersion: batch/v2alpha1
kind: CronJob
metadata:
name: "foobar"
spec:
schedule: "7 8-18 * * Mon-Fri"
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
jobTemplate:
spec:
activeDeadlineSeconds: 120
template:
spec:
restartPolicy: Never
20
containers:
21. INCIDENT #2: CONTRIBUTING FACTORS
21
• Wrong CronJob manifest and no automatic job cleanup
• Reliance on Kubernetes API server availability
• Ingress routes not kept as-is in case of outage
• No quota for number of pods
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-resources
spec:
hard:
pods: "1500"
22. INCIDENT
#3
23. INCIDENT #3: INGRESS ERRORS
23
24. INCIDENT #3: COREDNS OOMKILL
coredns invoked oom-killer:
gfp_mask=0x14000c0(GFP_KERNEL),
nodemask=(null), order=0, oom_score_adj=994
Memory cgroup out of memory: Kill process 6428
(coredns) score 2050 or sacrifice child
oom_reaper: reaped process 6428 (coredns),
now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
24
restarts
25. STOP THE BLEEDING: INCREASE MEMORY LIMIT
4Gi
2Gi
200Mi
25
26. SPIKE IN HTTP REQUESTS
26
27. SPIKE IN DNS QUERIES
27
28. INCREASE IN MEMORY USAGE
28
29. INCIDENT #3: CONTRIBUTING FACTORS
• HTTP retries
• No DNS caching
• Kubernetes ndots:5 problem
• Short maximum lifetime of HTTP connections
• Fixed memory limit for CoreDNS
• Monitoring affected by DNS outage
29
github.com/zalando-incubator/kubernetes-on-aws/blob/dev/docs/postmortems/jan-2019-dns-outage.md
30. INCIDENT
#4
31. #4: KERNEL OOM KILLER
⇒ all containers
on this node down
31
32. INCIDENT #4: KUBELET MEMORY
32
33. UPSTREAM ISSUE REPORTED
https://github.com/kubernetes/kubernetes/issues/73587
33
34. INCIDENT #4: THE PATCH
34
https://github.com/kubernetes/kubernetes/issues/73587
35. INCIDENT
#5
36. INCIDENT #5: IMPACT
Error during Pod creation:
MountVolume.SetUp failed for volume
"outfit-delivery-api-credentials" :
secrets "outfit-delivery-api-credentials" not found
⇒ All new Kubernetes deployments fail
36
37. INCIDENT #5: CREDENTIALS QUEUE
17:30:07 | [pool-6-thread-1 ] | Current queue size: 7115, current number of active workers: 20
17:31:07 | [pool-6-thread-1 ] | Current queue size: 7505, current number of active workers: 20
17:32:07 | [pool-6-thread-1 ] | Current queue size: 7886, current number of active workers: 20
..
17:37:07 | [pool-6-thread-1
] | Current queue size: 9686, current number of active workers: 20
..
17:44:07 | [pool-6-thread-1
] | Current queue size: 11976, current number of active workers: 20
..
19:16:07 | [pool-6-thread-1
37
] | Current queue size: 58381, current number of active workers: 20
38. INCIDENT #5: CPU THROTTLING
38
39. INCIDENT #5: WHAT HAPPENED
Scaled down IAM provider
to reduce Slack
+ Number of deployments increased
⇒ Process could not process credentials fast enough
39
40. SLACK
CPU/memory requests "block" resources on nodes.
Difference between actual usage and requests → Slack
Node
CPU
Memory
"Slack"
40
41. DISABLING CPU THROTTLING
kubelet … --cpu-cfs-quota=false
[Announcement] CPU limits will be disabled
⇒ Ingress Latency Improvements
41
42. MANAGED
KUBERNETES?
42
43. WILL MANAGED K8S SAVE US?
GKE: monthly uptime percentage at 99.95% for regional clusters
43
44. WILL MANAGED K8S SAVE US?
NO
(not really)
e.g. AWS EKS uptime SLA is only for API server
44
45. PRODUCTION PROOFING AWS EKS
List of things you might
want to look at for EKS
in production
https://medium.com/glia-tech/productionproofing-e
ks-ed52951ffd6c
45
46. AWS EKS IN PRODUCTION
https://kubedex.com/90-days-of-aws-eks-in-production/
46
47. DOCKER.. (ON GKE)
47
https://github.com/kubernetes/kubernetes/blob/8fd414537b5143ab0
39cb910590237cabf4af783/cluster/gce/gci/health-monitor.sh#L29
48. WELCOME TO
CLOUD NATIVE!
49. 49
50.
51. KUBERNETES FAILURE STORIES
20 failure stories so far
What about yours?
github.com/hjacobs/kubernetes-failure-stories
51
52. QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k