CONTINUOUS DELIVERY AT ZALANDO
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. CONTINUOUS
DELIVERY AT
ZALANDO
HU BERLIN GUEST LECTURE
2020-01-29
HENNING JACOBS
@try_except_
2. ABOUT ME
Henning Jacobs
Senior Principal
- Head of Developer Productivity
- Zalando since 2010
- University Karlsruhe (TH) - now KIT
Blog: srcco.de
2
Twitter: @try_except_
3.
4. ZALANDO AT A GLANCE
~ 5.4
billion EUR
> 300
million
revenue 2018
4
~ 14,000 > 80%
employees in
Europe of visits via
mobile devices
as of June 2019
visits
per
month
> 400,000
> 28 product choices
million > 2,000 17
brands countries
active customers
5. SOFTWARE PRODUCT DEVELOPMENT AT ZALANDO
5 tech
hubs
Across
Europe
> 2,000 > 200
employees
In Tech
From >
Nations
100
Cross-functional,
Agile Delivery
Teams
6. HOME-BREWED SOFTWARE
>1100 developers
>200 development teams
>2000 applications
6
7. OSS WE BUILD ON
• Java (OpenJDK)
• Apache Tomcat
• PostgreSQL
• Python
• JS, Scala, Go, ..
• Kubernetes
7
8. YOU BUILD IT, YOU RUN IT
The traditional model is that you take your software to the
wall that separates development and operations, and
throw it over and then forget about it. Not at Amazon.
You build it, you run it. This brings developers into
contact with the day-to-day operation of their software. It
also brings them into day-to-day contact with the
customer.
- A Conversation with Werner Vogels, ACM Queue, 2006
8
9. ON-CALL: YOU OWN IT, YOU RUN IT
When things are broken,
we want people with the best
context trying to fix things.
- Blake Scrivener, Netflix SRE Manager
9
10. CI/CD
“Continuous Integration is a software development practice where
members of a team integrate their work frequently, usually each person
integrates at least daily - leading to multiple integrations per day. ”
“Continuous Delivery is the ability to get changes of all types—
including new features, configuration changes, bug fixes and
experiments—into production, or into the hands of users, safely and
quickly in a sustainable way.”
- Martin Fowler
10
11. CONTINUOUS DELIVERY
11
12. CONTINUOUS DELIVERY
12
13. WHY CONTINUOUS DELIVERY?
• Lower risks: smaller batches, automation
• Accelerate Time to Market
• Build the Right Product
• Increase Productivity and Efficiency
• Improve Customer Satisfaction
13
14. CONTINUOUS DELIVERY
If it hurts, do it more
frequently, and bring the
pain forward.
- Jez Humble
14
15. 2010
"Sysop-Test"
"QA-Test"
15
16. UNDERSTANDING CYCLE TIME
16
17. SHORTEN CYCLE TIME
17
docs.microsoft.com/en-us/azure/devops/learn/what-is-devops
18. THE PHOENIX PROJECT - 2013
"The Three Ways"
1. Systems Thinking
2. Amplify Feedback Loops
3. Culture of Continual
Experimentation And Learning
18
19. DEVELOPER JOURNEY
Consistent story
that models
all aspects of SW dev
19
20. Developer
Journey
20
21. Correctness
Compliance
GDPR
Security
Cost Efficiency
24x7 On Call
Governance
Resilience
Capacity
...
21
Developer
Journey
22. CLOUD NATIVE
.. uses an open source software stack to deploy
applications as microservices, packaging each part into
its own container, and dynamically orchestrating those
containers to optimize resource utilization.
Cloud native technologies enable software developers to
build great products faster.
- https://www.cncf.io/
22
23. CONTAINERS END-TO-END
Setup
Code
Build
Test
Deploy
Cloud Native Application Runtime
23
Operate
24. CONTAINERS
24
25. CONTAINERS
25
26. CONTAINERS VS VIRTUAL MACHINES
26
docker.com/resources/what-container
27. PLAN & SETUP
27
28. Plan
Stories
Rules of Play
Tech Radar
28
29.
30. Setup
Application
Bootstrapping
30
31.
32.
33. BUILD & TEST
33
34. CONTINUOUS DELIVERY PLATFORM: BUILD
push
Git
code
34
CDP
35. A typical pipeline:
36. A typical pipeline:
Success Success Failure Skipped
Build some
artifacts Deploy some
artifacts Deploy some
end-to-end
tests
(artifacts) Deploy some
artifacts
37.
38. PULL REQUESTS
❏ PRs should be open for at least 3 days to review
❏ PRs should be quick to understand and review
❏ PRs should ideally have <300 lines of changes
38
39. PULL REQUESTS
❏ PRs should be open for at least
3 days to review
❏ PRs should be quick to
understand and review
❏ PRs should ideally have
<300 lines of changes
39
40. DEPLOY
40
41. Kubernetes
Deploy
41
42. DEPLOYMENT CONFIGURATION
├── deploy/apply
│
├── deployment.yaml
│
├── credentials.yaml # Zalando IAM
│
├── ingress.yaml
│
└── service.yaml
└── delivery.yaml
# Zalando CI/CD
42
43. INGRESS.YAML
kind: Ingress
metadata:
name: "..."
spec:
rules:
# DNS name your application should be exposed on
- host: "myapp.foo.example.org"
http:
paths:
- backend:
serviceName: "myapp"
servicePort: 80
43
44. CDP: DEPLOY
"glorified kubectl apply"
44
45. CDP: OPTIONAL APPROVAL
45
46. CONTINUOUS DELIVERY PLATFORM
46
47. CLOUD FORMATION VIA CI/CD
"Infrastructure as Code"
├── deploy/apply
│
├── deployment.yaml
│
├── cf-iam-role.yaml
│
├── cf-rds.yaml
│
├── kube-ingress.yaml
│
├── kube-secret.yaml
│
└── kube-service.yaml
└── delivery.yaml
47
# Kubernetes
# AWS IAM Role
# AWS RDS Database
# CI/CD config
48. You build it, you run it!
Deploy
48
49. EMERGENCY ACCESS SERVICE
Emergency access by referencing Incident
zkubectl cluster-access request \
--emergency -i INC REASON
Privileged production access via 4-eyes
zkubectl cluster-access request REASON
zkubectl cluster-access approve USERNAME
49
50. KUBERNETES WEB VIEW
kubectl get
pods,stacks,deploys,..
50
51. SEARCHING ACROSS 140+ CLUSTERS
51
codeberg.org/hjacobs/kube-web-view
52. codeberg.org/hjacobs/kube-web-view
53. SUMMARY
• Application Bootstrapping
• Git as source of truth and UI
• 4-eyes principle for master/production
• Declarative Kubernetes API configuration
and AWS CloudFormation
53
54. DEPLOYING CHANGES
❏ Every deployment must be approved by
dedicated QA testers
❏ Every master commit should go to production
❏ Multiple changes should be bundled into one deployment
54
55. DEPLOYING CHANGES
❏ Every deployment must be approved by
dedicated QA testers
❏ Every master commit should go to production
❏ Multiple changes should be bundled into one deployment
55
56. DELIVERY PERFORMANCE METRICS
• Lead Time
• Release Frequency
• Time to Restore Service
• Change Fail Rate
56
srcco.de/posts/accelerate-software-delivery-performance.html
57. CONTAINERS
57
From "Accelerate: The Science of Lean Software and DevOps"
58. DELIVERY PERFORMANCE METRICS
58
• Lead Time ≙ Commit to Prod
• Release Frequency ≙ Deploys/week/dev
• Time to Restore Service ≙ MTRS from incidents
• Change Fail Rate ≙ n/a
59. TRUNK-BASED DEVELOPMENT
We .. found that teams using branches
that live a short amount of time
(integration times less than a day)
combined with short merging and
integration periods (less than a day) do
better in terms of software delivery
performance than teams using
longer-lived branches
59
60. SHORT-LIVED GIT BRANCHES
60
61. SQUASHING
61
62. 62
63. PRE-COMMIT HOOKS
63
pre-commit.com
64. UNIT TESTS
64
65. STATIC ANALYSIS WITH SONARQUBE
65
66. VULNERABILITY SCANNING IN GITHUB
66
67. FEATURE TOGGLES
67
68. STACKSET: TRAFFIC SWITCHING
68
github.com/zalando-incubator/stackset-controller
69. TRAFFIC SWITCHING STEPS IN CDP
69
github.com/zalando-incubator/stackset-controller
70. DEPLOYING TO PRODUCTION
https://my-application.io
My Application v1
70
71. DEPLOYING TO PRODUCTION
71
https://my-application.io https://preview-v2.my-application.io
My Application v1 My Application v2
72. DEPLOYING TO PRODUCTION (BLUE/GREEN)
72
https://my-application.io https://preview-v2.my-application.io
My Application v1 My Application v2
73. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING)
https://my-application.io
https://preview-v2.my-application.io
1%
99%
My Application v1
73
My Application v2
74. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING)
https://my-application.io
https://preview-v2.my-application.io
10%
90%
My Application v1
74
My Application v2
75. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING)
https://my-application.io
https://preview-v2.my-application.io
50%
50%
My Application v1
75
My Application v2
76. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING)
https://my-application.io
https://preview-v2.my-application.io
100%
0%
My Application v1
76
My Application v2
77. APPLICATION STACK IN KUBERNETES
Ingress resource
name: app-1
Service resource
Deployment resource
labels:
application: app-1
77
78. TRAFFIC SWITCHING
ROLLING UPDATE DEPLOYMENT
100%
Blue: 75%
Green: 25%
deployment
78
79. TRAFFIC SWITCHING
ROLLING UPDATE DEPLOYMENT
100%
Blue: 99%
Green: 1%
deployment
79
80. SKIPPER TRAFFIC SUPPORT
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: my-app
annotations:
zalando.org/backend-weights: |
{"my-app-v1": 90, "my-app-v2": 10}
spec:
rules:
- host: my-app.io
...
github.com/zalando/skipper
80
81. MANUAL TRAFFIC SWITCHING
BLUE/GREEN DEPLOYMENT
IngressTemplate
80%
service-v1
deployment-v1
81
20%
service-v2
deployment-v2
82. MANUAL TRAFFIC SWITCHING
BLUE/GREEN DEPLOYMENT
IngressTemplate
Need to clean up resources
manually :’(
0%
service-v1
deployment-v1
82
100%
service-v2
deployment-v2
83. TRAFFIC SWITCHING
BLUE/GREEN DEPLOYMENT (StackSet)
80%
stack: v1
83
20%
stack: v2
84. STACKSET DEFINITION
apiVersion: zalando.org/v1
kind: StackSet
metadata:
name: my-app
spec:
ingress:
host: [my-application.io]
stackTemplate:
spec:
version: v1
podTemplate:
spec:
containers:
...
github.com/zalando-incubator/stackset-controller
84
85. ADDITIONAL FEATURES
85
86. PRESCALE STACKS
100%
HPA:
minReplicas: 3
maxReplicas: 30
stack: v2
stack: v1
86
0%
87. PRESCALE STACKS
0%
HPA:
minReplicas: 3
maxReplicas: 30
stack: v2
stack: v1
87
100%
88. PRESCALE STACKS
Desired: 0%
Actual: 100%
HPA:
minReplicas: 3
maxReplicas: 30
stack: v2
stack: v1
88
Desired: 100%
Actual: 0%
89. PRESCALE STACKS
Desired: 0%
Actual: 100%
Desired: 100%
Actual: 0%
HPA:
minReplicas: 3
maxReplicas: 30
stack: v1
89
stack: v2
90. PRESCALE STACKS
Desired: 0%
Actual: 0%
stack: v1
90
Desired: 100%
Actual: 100%
stack: v2
91. GRADUAL DEPLOYMENTS
(AUTOMATIC TRAFFIC SWITCHING)
91
92. INTEGRATE WITH APPLICATION METRICS
92
93. GRADUAL DEPLOYMENTS
Desired: 100%
Actual: 100%
stack: v1
93
Desired: 0%
Actual: 0%
stack: v2
94. GRADUAL DEPLOYMENTS
Desired: 0%
Actual: 0%
Desired: 100%
Actual: 100%
stack: v1
94
stack: v1’
(baseline stack)
stack: v2
95. GRADUAL DEPLOYMENTS
Desired: 1%
Actual: 1%
Desired: 98%
Actual: 98%
stack: v1
stack: v1’
(baseline stack)
stack: v2
Observe metrics for the two stacks
95
96. GRADUAL DEPLOYMENTS
Desired: 10%
Actual: 10%
Desired: 80%
Actual: 80%
stack: v1
stack: v1’
(baseline stack)
stack: v2
Observe metrics for the two stacks
96
97. GRADUAL DEPLOYMENTS
Desired: 10%
Actual: 10%
Desired: 80%
Actual: 80%
stack: v1
stack: v1’
(baseline stack)
stack: v2
Observe metrics for the two stacks
97
98. GRADUAL DEPLOYMENTS
Desired: 0%
Actual: 0%
Desired: 100%
Actual: 100%
stack: v1
stack: v1’
(baseline stack)
stack: v2
!DEPLOYMENT FAILED!
98
99. OPENTRACING
99
100. APPLICATION LOGS
100
101. ALERT ON CUSTOMER IMPACT
101
102. CHALLENGES
• Test coverage vs pipeline speed
• Maintaining test/staging environments
• Detecting (customer) impact quickly
• Database schema changes (DB evolutions)
• Keeping an overview of the microservice landscape
102
103. MICROSERVICES
103
104. MICROSERVICES
104
monzo.com/blog/we-built-network-isolation-for-1-500-services
105. ZALANDO: DEPLOYMENTS/WEEK
❏ ~300
❏ ~3000
❏ ~7000
❏ ~10000
105
106. ZALANDO: DEPLOYMENTS/WEEK
❏ ~300
❏ ~3000
❏ ~7000
❏ ~10000
106
107. CHALLENGES
Database schema changes (DB evolutions)
"The outage was caused by a change in the database schema
of the service which wasn't detected earlier in the testing and
deployment process. It was resolved by switching to the
software version that was compatible with the changed
database schema."
107
108. ADD COLUMN: SAFE WITHOUT DOWNTIME
stack: v1
108
ALTER TABLE …
ADD COLUMN ...
stack: v2
109. DROP COLUMN: OUTAGE!
stack: v1
109
ALTER TABLE … stack: v2
DROP COLUMN ...
110. OUTAGES CAUSED BY DEPLOYMENTS
❏ Systematic issues (e.g. lack of test coverage)
should be addressed
❏ It's most important to find the accountable person
and teach him/her how to do releases properly
❏ There is no single root cause for incidents, multiple
contributing factors must be considered
❏ Outages should make the team do less deployments
110
111. OUTAGES CAUSED BY DEPLOYMENTS
❏ Systematic issues (e.g. lack of test coverage)
should be addressed
❏ It's most important to find the accountable person
and teach him/her how to do releases properly
❏ There is no single root cause for incidents, multiple
contributing factors must be considered
❏ Outages should make the team do less deployments
111
112. SUMMARY
• You build it, you run it!
• Git with short-lived branches
• Continuous Deployment as default
• Confidence via automated tests, gradual rollouts
• No trade-off between speed and stability/quality!
• 4 key metrics: Deployment Frequency, Lead Time,
Change Fail Rate, MTTR
112
113. OPEN SOURCE & MORE
Open Source at Zalando
opensource.zalando.com
More Zalando Tech Talks
github.com/zalando/public-presentations
113
114. QUESTIONS?
HENNING JACOBS
SENIOR PRINCIPAL
henning@zalando.de
@try_except_
Illustrations by @01k