Developer Experience at Zalando2
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. Developer
Experience
at Zalando
CNCF END USER
SIG-DX
2019-04-18
HENNING JACOBS
@try_except_
2. EUROPE’S LEADING ONLINE FASHION PLATFORM
2
3. ZALANDO AT A GLANCE
~ 5.4
billion EUR
> 250
million
revenue 2018
3
> 15.000 > 79%
employees in
Europe of visits via
mobile devices
visits
per
month
> 300.000
> 26 product choices
million ~ 2.000 17
brands countries
active customers
4. > 200
development teams
> 1100
developers
4
Platform
5. YOU BUILD IT, YOU RUN IT
The traditional model is that you take your software to the
wall that separates development and operations, and
throw it over and then forget about it. Not at Amazon.
You build it, you run it. This brings developers into
contact with the day-to-day operation of their software. It
also brings them into day-to-day contact with the
customer.
- A Conversation with Werner Vogels, ACM Queue, 2006
5
6. ON-CALL: YOU OWN IT, YOU RUN IT
When things are broken,
we want people with the best
context trying to fix things.
- Blake Scrivener, Netflix SRE Manager
6
7. KUBERNETES @ ZALANDO
7
Default
Deployment
Target 114
clusters
1400~
nodes Node
Autoscaling
Since
Oct 2016 From v1.4
to v1.12
8. DEVELOPERS USING KUBERNETES
8
9. DEVELOPER JOURNEY
Consistent story
that models
all aspects of SW dev
9
10. Developer
Journey
10
11. Correctness
Compliance
GDPR
Security
Cost Efficiency
24x7 On Call
Governance
Resilience
Capacity
...
11
Developer
Journey
12. DEVELOPER PRODUCTIVITY
Setup
Code
Build
Test
Deploy
Cloud Native Application Runtime
12
Operate
13.
14. PLAN & SETUP
14
15. Plan
Stories
Rules of Play
Tech Radar
15
16.
17. Setup
Application
Bootstrapping
17
18.
19.
20. BUILD & TEST
20
21. CONTINUOUS DELIVERY PLATFORM: BUILD
push
Git
code
21
CDP
22.
23. DEPLOY
23
24. Kubernetes
Deploy
24
25. DEPLOYMENT CONFIGURATION
├── deploy/apply
│
├── deployment.yaml
│
├── credentials.yaml # Zalando IAM
│
├── ingress.yaml
│
└── service.yaml
└── delivery.yaml
# Zalando CI/CD
25
26. INGRESS.YAML
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "myapp.foo.example.org"
http:
paths:
- backend:
serviceName: "myapp"
servicePort: 80
26
27. TEMPLATING: MUSTACHE
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "{{{APPLICATION}}}.example.org"
http:
paths:
- backend:
serviceName: "{{{APPLICATION}}}"
servicePort: 80
27
28. CONTINUOUS DELIVERY PLATFORM
28
29. CDP: DEPLOY
"glorified kubectl apply"
29
30. CDP: OPTIONAL APPROVAL
30
31. STACKSET: TRAFFIC SWITCHING
31
github.com/zalando-incubator/stackset-controller
32. STACKSET CRD
apiVersion: zalando.org/v1
kind: StackSet
...
spec:
ingress:
hosts: ["foo.example.org"]
backendPort: 8080
stackLifecycle:
scaledownTTLSeconds: 1800
limit: 5
stackTemplate:
spec:
podTemplate:
...
32
github.com/zalando-incubator/stackset-controller
33. TRAFFIC SWITCHING STEPS IN CDP
33
github.com/zalando-incubator/stackset-controller
34. EMERGENCY ACCESS SERVICE
Get emergency access by referencing existing Incident ticket:
zkubectl cluster-access request --emergency -i INC REASON
Get privileged production access via 4-eyes:
zkubectl cluster-access request REASON
zkubectl cluster-access approve USERNAME
34
35. INTEGRATIONS
35
36. CLOUD FORMATION VIA CI/CD
"Infrastructure as Code"
├── deploy/apply
│
├── deployment.yaml
│
├── cf-iam-role.yaml
│
├── cf-rds.yaml
│
├── kube-ingress.yaml
│
├── kube-secret.yaml
│
└── kube-service.yaml
└── delivery.yaml
36
# Kubernetes
# AWS IAM Role
# AWS RDS Database
# CI/CD config
37. ZALANDO IAM/OAUTH VIA CRD
apiVersion: zalando.org/v1
kind: PlatformCredentialsSet
..
spec:
application: my-app
tokens:
read-only:
privileges:
- com.zalando::foobar.read
clients:
employee:
grant: authorization-code
realm: users
redirectUri: https://example.org/auth/callback
37
38. POSTGRES OPERATOR
Application to manage
PostgreSQL clusters on
Kubernetes
>700
clusters running
on Kubernetes
38
github.com/zalando/postgres-operator
39. Elasticsearch
2.500 vCPUs
1 TB RAM
Elasticsearch in Kubernetes
github.com/zalando-incubator/es-operator/
40. SUMMARY
• Application Bootstrapping
• Git as source of truth and UI
• 4-eyes principle for master/production
• Extensible Kubernetes API as primary interface
• OAuth/IAM credentials
• PostgreSQL
• CloudFormation for proprietary AWS services
40
41. DELIVERY PERFORMANCE METRICS
• Lead Time
• Release Frequency
• Time to Restore Service
• Change Fail Rate
41
https://srcco.de/posts/accelerate-software-delivery-performance.html
42. CONTAINERS
42
From "Accelerate: The Science of Lean Software and DevOps"
43. DELIVERY PERFORMANCE METRICS
43
• Lead Time ≙ Commit to Prod
• Release Frequency ≙ Deploys/week/dev
• Time to Restore Service ≙ MTRS from incidents
• Change Fail Rate ≙ n/a
44. “.. means establishing empathy with internal
consumers (read: developers) and collaborating
with them on the design. Platform product managers
establish roadmaps and ensure the platform delivers
value to the business and enhances the developer
experience.”
- ThoughtWorks Technology Radar
45.
46. DEVELOPER SATISFACTION
46
47. DOCUMENTATION
"Documentation is hard to find"
"Documentation is not comprehensive enough"
"Remove unnecessary complexity and obstacles."
"Get the documentation up to date and prepare
use cases"
"More and more clear documentation"
"More detailed docs, example repos with more
complicated deployments."
47
48. DOCUMENTATION
• Restructure following
https://www.divio.com/en/blog/documentation/
• Concepts
• How Tos
• Tutorials
• Reference
• Global Search
• Weekly Health Check: Support → Documentation
48
49.
50. NEWSLETTER
"You can now.."
• You can now benefit from the most recent
Kubernetes 1.12 features, e.g. ..
• You can now analyse your Kotlin project with
SonarQube and upload your Scala code coverage
report to SonarQube
50
51. SIGNAL: ISSUE UPVOTES
51
52. TESTIMONIALS
“So, thank you, Team Automata, for listening to our
community, taking our upvotes in consideration when
developing new solutions and building every day
'the first CI that doesn't suck'.”
- a user, October 2018
52
53. MONITORING
53
54. ZMON DASHBOARD
github.com/zalando/zmon
54
55. GRAFANA APPLICATION DASHBOARD
55
56. KUBERNETES RESOURCE REPORT
56
github.com/hjacobs/kube-resource-report
57. RESOURCE REPORT: TEAMS
Sorting teams by
Slack Costs
57
github.com/hjacobs/kube-resource-report
58. RESOURCE REPORT: APPLICATIONS
"Slack"
58
59. RESOURCE REPORT: CLUSTERS
"Slack"
59
github.com/hjacobs/kube-resource-report
60. UNDER THE HOOD
60
61. ZALANDO: DECISION
1. Forbid Memory Overcommit
• Implement mutating admission webhook
• Set requests = limits
2. Disable CPU CFS Quota in all clusters
• --cpu-cfs-quota=false
61
62. KUBERNETES CLUSTER SETUP
Master
Config
Master
EC2
Instances
CloudFormation
Stacks
Worker
github.com/zalando-incubator/kubernetes-on-aws
62
63. CLUSTER PROVISIONING
CLUSTER LIFECYCLE MANAGER (CLM)
ADMIN
create
CloudFormation
apply manifests
CLUSTER
REGISTRY
CLM
API
create
CF stack
provision
resources
API
...
...
...
github.com/zalando-incubator/cluster-lifecycle-manager
63
github.com/zalando-incubator/kubernetes-on-aws
64. INGRESS
64
https://github.com/zalando-incubator/kube-ingress-aws-controller
65. apiVersion: poc.autoscaling.k8s.io/v1alpha1
kind: VerticalPodAutoscaler
metadata:
name: prometheus-vpa
namespace: kube-system
spec:
selector:
matchLabels:
application: prometheus
updatePolicy:
updateMode: Auto
65
VPA FOR PROMETHEUS
66. VERTICAL POD AUTOSCALER
limit/requests adapted by VPA
66
67. HORIZONTAL POD AUTOSCALING (CUSTOM METRICS)
67
Queue Length Ingress Req/s
Prometheus Query ZMON Check
github.com/zalando-incubator/kube-metrics-adapter
68. DOWNSCALING DURING OFF-HOURS
Weekend
68
github.com/hjacobs/kube-downscaler
69. DOWNSCALING DURING OFF-HOURS
DEFAULT_UPTIME="Mon-Fri 07:30-20:30 CET"
annotations:
downscaler/exclude: "true"
69
github.com/hjacobs/kube-downscaler
70. KUBERNETES JANITOR
● TTL and expiry date annotations, e.g.
○ set time-to-live for your test deployment
● Custom rules, e.g.
○ delete everything without "app" label after 7 days
70
github.com/hjacobs/kube-janitor
71. JANITOR TTL ANNOTATION
# let's try out nginx, but only for 1 hour
kubectl run nginx --image=nginx
kubectl annotate deploy nginx janitor/ttl=1h
71
github.com/hjacobs/kube-janitor
72. CUSTOM JANITOR RULES
# require "app" label for new pods starting April 2019
- id: require-app-label-april-2019
resources:
- deployments
- statefulsets
jmespath: "!(spec.template.metadata.labels.app) &&
metadata.creationTimestamp > '2019-04-01'"
ttl: 7d
72
github.com/hjacobs/kube-janitor
73. EC2 SPOT NODES
72% savings
73
74. SPOT ASG / LAUNCH TEMPLATE
74
Not upstream in cluster-autoscaler (yet)
75. OPEN SOURCE
Kubernetes on AWS
github.com/zalando-incubator/kubernetes-on-aws
AWS ALB Ingress controller
github.com/zalando-incubator/kube-ingress-aws-controller
External DNS
github.com/kubernetes-incubator/external-dns
Postgres Operator
github.com/zalando/postgres-operator
Kubernetes Resource Report
github.com/hjacobs/kube-resource-report
Kubernetes Downscaler
github.com/hjacobs/kube-downscaler
Kubernetes Janitor
github.com/hjacobs/kube-janitor
75
76. MORE INFO
● DevOps Gathering 2019: Ensuring Kubernetes Cost Efficiency across (many) Clusters (slides)
● DevOpsCon Munich 2018: Running Kubernetes in Production: A Million Ways to Crash Your Cluster
● HighLoad++ Moscow 2018: Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latency (slides)
● DevOps Lisbon Meetup 2018: Kubernetes at Zalando
kubernetes-on-aws.readthedocs.io/en/latest/admin-guide/public-presentations.html
76
77. QUESTIONS?
HENNING JACOBS
HEAD OF
DEVELOPER PRODUCTIVITY
henning@zalando.de
@try_except_
Illustrations by @01k