CONTINUOUS DELIVERY AT ZALANDO

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. CONTINUOUS DELIVERY AT ZALANDO HU BERLIN GUEST LECTURE 2020-01-29 HENNING JACOBS @try_except_
2. ABOUT ME Henning Jacobs Senior Principal - Head of Developer Productivity - Zalando since 2010 - University Karlsruhe (TH) - now KIT Blog: srcco.de 2 Twitter: @try_except_
3.
4. ZALANDO AT A GLANCE ~ 5.4 billion EUR > 300 million revenue 2018 4 ~ 14,000 > 80% employees in Europe of visits via mobile devices as of June 2019 visits per month > 400,000 > 28 product choices million > 2,000 17 brands countries active customers
5. SOFTWARE PRODUCT DEVELOPMENT AT ZALANDO 5 tech hubs Across Europe > 2,000 > 200 employees In Tech From > Nations 100 Cross-functional, Agile Delivery Teams
6. HOME-BREWED SOFTWARE >1100 developers >200 development teams >2000 applications 6
7. OSS WE BUILD ON • Java (OpenJDK) • Apache Tomcat • PostgreSQL • Python • JS, Scala, Go, .. • Kubernetes 7
8. YOU BUILD IT, YOU RUN IT The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. - A Conversation with Werner Vogels, ACM Queue, 2006 8
9. ON-CALL: YOU OWN IT, YOU RUN IT When things are broken, we want people with the best context trying to fix things. - Blake Scrivener, Netflix SRE Manager 9
10. CI/CD “Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. ” “Continuous Delivery is the ability to get changes of all types— including new features, configuration changes, bug fixes and experiments—into production, or into the hands of users, safely and quickly in a sustainable way.” - Martin Fowler 10
11. CONTINUOUS DELIVERY 11
12. CONTINUOUS DELIVERY 12
13. WHY CONTINUOUS DELIVERY? • Lower risks: smaller batches, automation • Accelerate Time to Market • Build the Right Product • Increase Productivity and Efficiency • Improve Customer Satisfaction 13
14. CONTINUOUS DELIVERY If it hurts, do it more frequently, and bring the pain forward. - Jez Humble 14
15. 2010 "Sysop-Test" "QA-Test" 15
16. UNDERSTANDING CYCLE TIME 16
17. SHORTEN CYCLE TIME 17 docs.microsoft.com/en-us/azure/devops/learn/what-is-devops
18. THE PHOENIX PROJECT - 2013 "The Three Ways" 1. Systems Thinking 2. Amplify Feedback Loops 3. Culture of Continual Experimentation And Learning 18
19. DEVELOPER JOURNEY Consistent story that models all aspects of SW dev 19
20. Developer Journey 20
21. Correctness Compliance GDPR Security Cost Efficiency 24x7 On Call Governance Resilience Capacity ... 21 Developer Journey
22. CLOUD NATIVE .. uses an open source software stack to deploy applications as microservices, packaging each part into its own container, and dynamically orchestrating those containers to optimize resource utilization. Cloud native technologies enable software developers to build great products faster. - https://www.cncf.io/ 22
23. CONTAINERS END-TO-END Setup Code Build Test Deploy Cloud Native Application Runtime 23 Operate
24. CONTAINERS 24
25. CONTAINERS 25
26. CONTAINERS VS VIRTUAL MACHINES 26 docker.com/resources/what-container
27. PLAN & SETUP 27
28. Plan Stories Rules of Play Tech Radar 28
29.
30. Setup Application Bootstrapping 30
31.
32.
33. BUILD & TEST 33
34. CONTINUOUS DELIVERY PLATFORM: BUILD push Git code 34 CDP
35. A typical pipeline:
36. A typical pipeline: Success Success Failure Skipped Build some artifacts Deploy some artifacts Deploy some end-to-end tests (artifacts) Deploy some artifacts
37.
38. PULL REQUESTS ❏ PRs should be open for at least 3 days to review ❏ PRs should be quick to understand and review ❏ PRs should ideally have <300 lines of changes 38
39. PULL REQUESTS ❏ PRs should be open for at least 3 days to review ❏ PRs should be quick to understand and review ❏ PRs should ideally have <300 lines of changes 39
40. DEPLOY 40
41. Kubernetes Deploy 41
42. DEPLOYMENT CONFIGURATION ├── deploy/apply │ ├── deployment.yaml │ ├── credentials.yaml # Zalando IAM │ ├── ingress.yaml │ └── service.yaml └── delivery.yaml # Zalando CI/CD 42
43. INGRESS.YAML kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80 43
44. CDP: DEPLOY "glorified kubectl apply" 44
45. CDP: OPTIONAL APPROVAL 45
46. CONTINUOUS DELIVERY PLATFORM 46
47. CLOUD FORMATION VIA CI/CD "Infrastructure as Code" ├── deploy/apply │ ├── deployment.yaml │ ├── cf-iam-role.yaml │ ├── cf-rds.yaml │ ├── kube-ingress.yaml │ ├── kube-secret.yaml │ └── kube-service.yaml └── delivery.yaml 47 # Kubernetes # AWS IAM Role # AWS RDS Database # CI/CD config
48. You build it, you run it! Deploy 48
49. EMERGENCY ACCESS SERVICE Emergency access by referencing Incident zkubectl cluster-access request \ --emergency -i INC REASON Privileged production access via 4-eyes zkubectl cluster-access request REASON zkubectl cluster-access approve USERNAME 49
50. KUBERNETES WEB VIEW kubectl get pods,stacks,deploys,.. 50
51. SEARCHING ACROSS 140+ CLUSTERS 51 codeberg.org/hjacobs/kube-web-view
52. codeberg.org/hjacobs/kube-web-view
53. SUMMARY • Application Bootstrapping • Git as source of truth and UI • 4-eyes principle for master/production • Declarative Kubernetes API configuration and AWS CloudFormation 53
54. DEPLOYING CHANGES ❏ Every deployment must be approved by dedicated QA testers ❏ Every master commit should go to production ❏ Multiple changes should be bundled into one deployment 54
55. DEPLOYING CHANGES ❏ Every deployment must be approved by dedicated QA testers ❏ Every master commit should go to production ❏ Multiple changes should be bundled into one deployment 55
56. DELIVERY PERFORMANCE METRICS • Lead Time • Release Frequency • Time to Restore Service • Change Fail Rate 56 srcco.de/posts/accelerate-software-delivery-performance.html
57. CONTAINERS 57 From "Accelerate: The Science of Lean Software and DevOps"
58. DELIVERY PERFORMANCE METRICS 58 • Lead Time ≙ Commit to Prod • Release Frequency ≙ Deploys/week/dev • Time to Restore Service ≙ MTRS from incidents • Change Fail Rate ≙ n/a
59. TRUNK-BASED DEVELOPMENT We .. found that teams using branches that live a short amount of time (integration times less than a day) combined with short merging and integration periods (less than a day) do better in terms of software delivery performance than teams using longer-lived branches 59
60. SHORT-LIVED GIT BRANCHES 60
61. SQUASHING 61
62. 62
63. PRE-COMMIT HOOKS 63 pre-commit.com
64. UNIT TESTS 64
65. STATIC ANALYSIS WITH SONARQUBE 65
66. VULNERABILITY SCANNING IN GITHUB 66
67. FEATURE TOGGLES 67
68. STACKSET: TRAFFIC SWITCHING 68 github.com/zalando-incubator/stackset-controller
69. TRAFFIC SWITCHING STEPS IN CDP 69 github.com/zalando-incubator/stackset-controller
70. DEPLOYING TO PRODUCTION https://my-application.io My Application v1 70
71. DEPLOYING TO PRODUCTION 71 https://my-application.io https://preview-v2.my-application.io My Application v1 My Application v2
72. DEPLOYING TO PRODUCTION (BLUE/GREEN) 72 https://my-application.io https://preview-v2.my-application.io My Application v1 My Application v2
73. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 1% 99% My Application v1 73 My Application v2
74. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 10% 90% My Application v1 74 My Application v2
75. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 50% 50% My Application v1 75 My Application v2
76. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 100% 0% My Application v1 76 My Application v2
77. APPLICATION STACK IN KUBERNETES Ingress resource name: app-1 Service resource Deployment resource labels: application: app-1 77
78. TRAFFIC SWITCHING ROLLING UPDATE DEPLOYMENT 100% Blue: 75% Green: 25% deployment 78
79. TRAFFIC SWITCHING ROLLING UPDATE DEPLOYMENT 100% Blue: 99% Green: 1% deployment 79
80. SKIPPER TRAFFIC SUPPORT apiVersion: extensions/v1beta1 kind: Ingress metadata: name: my-app annotations: zalando.org/backend-weights: | {"my-app-v1": 90, "my-app-v2": 10} spec: rules: - host: my-app.io ... github.com/zalando/skipper 80
81. MANUAL TRAFFIC SWITCHING BLUE/GREEN DEPLOYMENT IngressTemplate 80% service-v1 deployment-v1 81 20% service-v2 deployment-v2
82. MANUAL TRAFFIC SWITCHING BLUE/GREEN DEPLOYMENT IngressTemplate Need to clean up resources manually :’( 0% service-v1 deployment-v1 82 100% service-v2 deployment-v2
83. TRAFFIC SWITCHING BLUE/GREEN DEPLOYMENT (StackSet) 80% stack: v1 83 20% stack: v2
84. STACKSET DEFINITION apiVersion: zalando.org/v1 kind: StackSet metadata: name: my-app spec: ingress: host: [my-application.io] stackTemplate: spec: version: v1 podTemplate: spec: containers: ... github.com/zalando-incubator/stackset-controller 84
85. ADDITIONAL FEATURES 85
86. PRESCALE STACKS 100% HPA: minReplicas: 3 maxReplicas: 30 stack: v2 stack: v1 86 0%
87. PRESCALE STACKS 0% HPA: minReplicas: 3 maxReplicas: 30 stack: v2 stack: v1 87 100%
88. PRESCALE STACKS Desired: 0% Actual: 100% HPA: minReplicas: 3 maxReplicas: 30 stack: v2 stack: v1 88 Desired: 100% Actual: 0%
89. PRESCALE STACKS Desired: 0% Actual: 100% Desired: 100% Actual: 0% HPA: minReplicas: 3 maxReplicas: 30 stack: v1 89 stack: v2
90. PRESCALE STACKS Desired: 0% Actual: 0% stack: v1 90 Desired: 100% Actual: 100% stack: v2
91. GRADUAL DEPLOYMENTS (AUTOMATIC TRAFFIC SWITCHING) 91
92. INTEGRATE WITH APPLICATION METRICS 92
93. GRADUAL DEPLOYMENTS Desired: 100% Actual: 100% stack: v1 93 Desired: 0% Actual: 0% stack: v2
94. GRADUAL DEPLOYMENTS Desired: 0% Actual: 0% Desired: 100% Actual: 100% stack: v1 94 stack: v1’ (baseline stack) stack: v2
95. GRADUAL DEPLOYMENTS Desired: 1% Actual: 1% Desired: 98% Actual: 98% stack: v1 stack: v1’ (baseline stack) stack: v2 Observe metrics for the two stacks 95
96. GRADUAL DEPLOYMENTS Desired: 10% Actual: 10% Desired: 80% Actual: 80% stack: v1 stack: v1’ (baseline stack) stack: v2 Observe metrics for the two stacks 96
97. GRADUAL DEPLOYMENTS Desired: 10% Actual: 10% Desired: 80% Actual: 80% stack: v1 stack: v1’ (baseline stack) stack: v2 Observe metrics for the two stacks 97
98. GRADUAL DEPLOYMENTS Desired: 0% Actual: 0% Desired: 100% Actual: 100% stack: v1 stack: v1’ (baseline stack) stack: v2 !DEPLOYMENT FAILED! 98
99. OPENTRACING 99
100. APPLICATION LOGS 100
101. ALERT ON CUSTOMER IMPACT 101
102. CHALLENGES • Test coverage vs pipeline speed • Maintaining test/staging environments • Detecting (customer) impact quickly • Database schema changes (DB evolutions) • Keeping an overview of the microservice landscape 102
103. MICROSERVICES 103
104. MICROSERVICES 104 monzo.com/blog/we-built-network-isolation-for-1-500-services
105. ZALANDO: DEPLOYMENTS/WEEK ❏ ~300 ❏ ~3000 ❏ ~7000 ❏ ~10000 105
106. ZALANDO: DEPLOYMENTS/WEEK ❏ ~300 ❏ ~3000 ❏ ~7000 ❏ ~10000 106
107. CHALLENGES Database schema changes (DB evolutions) "The outage was caused by a change in the database schema of the service which wasn't detected earlier in the testing and deployment process. It was resolved by switching to the software version that was compatible with the changed database schema." 107
108. ADD COLUMN: SAFE WITHOUT DOWNTIME stack: v1 108 ALTER TABLE … ADD COLUMN ... stack: v2
109. DROP COLUMN: OUTAGE! stack: v1 109 ALTER TABLE … stack: v2 DROP COLUMN ...
110. OUTAGES CAUSED BY DEPLOYMENTS ❏ Systematic issues (e.g. lack of test coverage) should be addressed ❏ It's most important to find the accountable person and teach him/her how to do releases properly ❏ There is no single root cause for incidents, multiple contributing factors must be considered ❏ Outages should make the team do less deployments 110
111. OUTAGES CAUSED BY DEPLOYMENTS ❏ Systematic issues (e.g. lack of test coverage) should be addressed ❏ It's most important to find the accountable person and teach him/her how to do releases properly ❏ There is no single root cause for incidents, multiple contributing factors must be considered ❏ Outages should make the team do less deployments 111
112. SUMMARY • You build it, you run it! • Git with short-lived branches • Continuous Deployment as default • Confidence via automated tests, gradual rollouts • No trade-off between speed and stability/quality! • 4 key metrics: Deployment Frequency, Lead Time, Change Fail Rate, MTTR 112
113. OPEN SOURCE & MORE Open Source at Zalando opensource.zalando.com More Zalando Tech Talks github.com/zalando/public-presentations 113
114. QUESTIONS? HENNING JACOBS SENIOR PRINCIPAL henning@zalando.de @try_except_ Illustrations by @01k

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-04 18:57
浙ICP备14020137号-1 $访客地图$