CONTINUOUS DELIVERY AT ZALANDO

1. CONTINUOUS DELIVERY AT ZALANDO HU BERLIN GUEST LECTURE 2020-01-29 HENNING JACOBS @try_except_

2. ABOUT ME Henning Jacobs Senior Principal - Head of Developer Productivity - Zalando since 2010 - University Karlsruhe (TH) - now KIT Blog: srcco.de 2 Twitter: @try_except_

3.

4. ZALANDO AT A GLANCE ~ 5.4 billion EUR > 300 million revenue 2018 4 ~ 14,000 > 80% employees in Europe of visits via mobile devices as of June 2019 visits per month > 400,000 > 28 product choices million > 2,000 17 brands countries active customers

5. SOFTWARE PRODUCT DEVELOPMENT AT ZALANDO 5 tech hubs Across Europe > 2,000 > 200 employees In Tech From > Nations 100 Cross-functional, Agile Delivery Teams

6. HOME-BREWED SOFTWARE >1100 developers >200 development teams >2000 applications 6

7. OSS WE BUILD ON • Java (OpenJDK) • Apache Tomcat • PostgreSQL • Python • JS, Scala, Go, .. • Kubernetes 7

8. YOU BUILD IT, YOU RUN IT The traditional model is that you take your software to the wall that separates development and operations, and throw it over and then forget about it. Not at Amazon. You build it, you run it. This brings developers into contact with the day-to-day operation of their software. It also brings them into day-to-day contact with the customer. - A Conversation with Werner Vogels, ACM Queue, 2006 8

9. ON-CALL: YOU OWN IT, YOU RUN IT When things are broken, we want people with the best context trying to fix things. - Blake Scrivener, Netflix SRE Manager 9

10. CI/CD “Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. ” “Continuous Delivery is the ability to get changes of all types— including new features, configuration changes, bug fixes and experiments—into production, or into the hands of users, safely and quickly in a sustainable way.” - Martin Fowler 10

11. CONTINUOUS DELIVERY 11

12. CONTINUOUS DELIVERY 12

13. WHY CONTINUOUS DELIVERY? • Lower risks: smaller batches, automation • Accelerate Time to Market • Build the Right Product • Increase Productivity and Efficiency • Improve Customer Satisfaction 13

14. CONTINUOUS DELIVERY If it hurts, do it more frequently, and bring the pain forward. - Jez Humble 14

15. 2010 "Sysop-Test" "QA-Test" 15

16. UNDERSTANDING CYCLE TIME 16

17. SHORTEN CYCLE TIME 17 docs.microsoft.com/en-us/azure/devops/learn/what-is-devops

18. THE PHOENIX PROJECT - 2013 "The Three Ways" 1. Systems Thinking 2. Amplify Feedback Loops 3. Culture of Continual Experimentation And Learning 18

19. DEVELOPER JOURNEY Consistent story that models all aspects of SW dev 19

20. Developer Journey 20

21. Correctness Compliance GDPR Security Cost Efficiency 24x7 On Call Governance Resilience Capacity ... 21 Developer Journey

22. CLOUD NATIVE .. uses an open source software stack to deploy applications as microservices, packaging each part into its own container, and dynamically orchestrating those containers to optimize resource utilization. Cloud native technologies enable software developers to build great products faster. - https://www.cncf.io/ 22

23. CONTAINERS END-TO-END Setup Code Build Test Deploy Cloud Native Application Runtime 23 Operate

24. CONTAINERS 24

25. CONTAINERS 25

26. CONTAINERS VS VIRTUAL MACHINES 26 docker.com/resources/what-container

27. PLAN & SETUP 27

28. Plan Stories Rules of Play Tech Radar 28

29.

30. Setup Application Bootstrapping 30

31.

32.

33. BUILD & TEST 33

34. CONTINUOUS DELIVERY PLATFORM: BUILD push Git code 34 CDP

35. A typical pipeline:

36. A typical pipeline: Success Success Failure Skipped Build some artifacts Deploy some artifacts Deploy some end-to-end tests (artifacts) Deploy some artifacts

37.

38. PULL REQUESTS ❏ PRs should be open for at least 3 days to review ❏ PRs should be quick to understand and review ❏ PRs should ideally have <300 lines of changes 38

39. PULL REQUESTS ❏ PRs should be open for at least 3 days to review ❏ PRs should be quick to understand and review ❏ PRs should ideally have <300 lines of changes 39

40. DEPLOY 40

41. Kubernetes Deploy 41

42. DEPLOYMENT CONFIGURATION ├── deploy/apply │ ├── deployment.yaml │ ├── credentials.yaml # Zalando IAM │ ├── ingress.yaml │ └── service.yaml └── delivery.yaml # Zalando CI/CD 42

43. INGRESS.YAML kind: Ingress metadata: name: "..." spec: rules: # DNS name your application should be exposed on - host: "myapp.foo.example.org" http: paths: - backend: serviceName: "myapp" servicePort: 80 43

44. CDP: DEPLOY "glorified kubectl apply" 44

45. CDP: OPTIONAL APPROVAL 45

46. CONTINUOUS DELIVERY PLATFORM 46

47. CLOUD FORMATION VIA CI/CD "Infrastructure as Code" ├── deploy/apply │ ├── deployment.yaml │ ├── cf-iam-role.yaml │ ├── cf-rds.yaml │ ├── kube-ingress.yaml │ ├── kube-secret.yaml │ └── kube-service.yaml └── delivery.yaml 47 # Kubernetes # AWS IAM Role # AWS RDS Database # CI/CD config

48. You build it, you run it! Deploy 48

49. EMERGENCY ACCESS SERVICE Emergency access by referencing Incident zkubectl cluster-access request \ --emergency -i INC REASON Privileged production access via 4-eyes zkubectl cluster-access request REASON zkubectl cluster-access approve USERNAME 49

50. KUBERNETES WEB VIEW kubectl get pods,stacks,deploys,.. 50

51. SEARCHING ACROSS 140+ CLUSTERS 51 codeberg.org/hjacobs/kube-web-view

52. codeberg.org/hjacobs/kube-web-view

53. SUMMARY • Application Bootstrapping • Git as source of truth and UI • 4-eyes principle for master/production • Declarative Kubernetes API configuration and AWS CloudFormation 53

54. DEPLOYING CHANGES ❏ Every deployment must be approved by dedicated QA testers ❏ Every master commit should go to production ❏ Multiple changes should be bundled into one deployment 54

55. DEPLOYING CHANGES ❏ Every deployment must be approved by dedicated QA testers ❏ Every master commit should go to production ❏ Multiple changes should be bundled into one deployment 55

56. DELIVERY PERFORMANCE METRICS • Lead Time • Release Frequency • Time to Restore Service • Change Fail Rate 56 srcco.de/posts/accelerate-software-delivery-performance.html

57. CONTAINERS 57 From "Accelerate: The Science of Lean Software and DevOps"

58. DELIVERY PERFORMANCE METRICS 58 • Lead Time ≙ Commit to Prod • Release Frequency ≙ Deploys/week/dev • Time to Restore Service ≙ MTRS from incidents • Change Fail Rate ≙ n/a

59. TRUNK-BASED DEVELOPMENT We .. found that teams using branches that live a short amount of time (integration times less than a day) combined with short merging and integration periods (less than a day) do better in terms of software delivery performance than teams using longer-lived branches 59

60. SHORT-LIVED GIT BRANCHES 60

61. SQUASHING 61

62. 62

63. PRE-COMMIT HOOKS 63 pre-commit.com

64. UNIT TESTS 64

65. STATIC ANALYSIS WITH SONARQUBE 65

66. VULNERABILITY SCANNING IN GITHUB 66

67. FEATURE TOGGLES 67

68. STACKSET: TRAFFIC SWITCHING 68 github.com/zalando-incubator/stackset-controller

69. TRAFFIC SWITCHING STEPS IN CDP 69 github.com/zalando-incubator/stackset-controller

70. DEPLOYING TO PRODUCTION https://my-application.io My Application v1 70

71. DEPLOYING TO PRODUCTION 71 https://my-application.io https://preview-v2.my-application.io My Application v1 My Application v2

72. DEPLOYING TO PRODUCTION (BLUE/GREEN) 72 https://my-application.io https://preview-v2.my-application.io My Application v1 My Application v2

73. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 1% 99% My Application v1 73 My Application v2

74. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 10% 90% My Application v1 74 My Application v2

75. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 50% 50% My Application v1 75 My Application v2

76. DEPLOYING TO PRODUCTION (TRAFFIC SWITCHING) https://my-application.io https://preview-v2.my-application.io 100% 0% My Application v1 76 My Application v2

77. APPLICATION STACK IN KUBERNETES Ingress resource name: app-1 Service resource Deployment resource labels: application: app-1 77

78. TRAFFIC SWITCHING ROLLING UPDATE DEPLOYMENT 100% Blue: 75% Green: 25% deployment 78

79. TRAFFIC SWITCHING ROLLING UPDATE DEPLOYMENT 100% Blue: 99% Green: 1% deployment 79

80. SKIPPER TRAFFIC SUPPORT apiVersion: extensions/v1beta1 kind: Ingress metadata: name: my-app annotations: zalando.org/backend-weights: | {"my-app-v1": 90, "my-app-v2": 10} spec: rules: - host: my-app.io ... github.com/zalando/skipper 80

81. MANUAL TRAFFIC SWITCHING BLUE/GREEN DEPLOYMENT IngressTemplate 80% service-v1 deployment-v1 81 20% service-v2 deployment-v2

82. MANUAL TRAFFIC SWITCHING BLUE/GREEN DEPLOYMENT IngressTemplate Need to clean up resources manually :’( 0% service-v1 deployment-v1 82 100% service-v2 deployment-v2

83. TRAFFIC SWITCHING BLUE/GREEN DEPLOYMENT (StackSet) 80% stack: v1 83 20% stack: v2

84. STACKSET DEFINITION apiVersion: zalando.org/v1 kind: StackSet metadata: name: my-app spec: ingress: host: [my-application.io] stackTemplate: spec: version: v1 podTemplate: spec: containers: ... github.com/zalando-incubator/stackset-controller 84

85. ADDITIONAL FEATURES 85

86. PRESCALE STACKS 100% HPA: minReplicas: 3 maxReplicas: 30 stack: v2 stack: v1 86 0%

87. PRESCALE STACKS 0% HPA: minReplicas: 3 maxReplicas: 30 stack: v2 stack: v1 87 100%

88. PRESCALE STACKS Desired: 0% Actual: 100% HPA: minReplicas: 3 maxReplicas: 30 stack: v2 stack: v1 88 Desired: 100% Actual: 0%

89. PRESCALE STACKS Desired: 0% Actual: 100% Desired: 100% Actual: 0% HPA: minReplicas: 3 maxReplicas: 30 stack: v1 89 stack: v2

90. PRESCALE STACKS Desired: 0% Actual: 0% stack: v1 90 Desired: 100% Actual: 100% stack: v2

91. GRADUAL DEPLOYMENTS (AUTOMATIC TRAFFIC SWITCHING) 91

92. INTEGRATE WITH APPLICATION METRICS 92

93. GRADUAL DEPLOYMENTS Desired: 100% Actual: 100% stack: v1 93 Desired: 0% Actual: 0% stack: v2

94. GRADUAL DEPLOYMENTS Desired: 0% Actual: 0% Desired: 100% Actual: 100% stack: v1 94 stack: v1’ (baseline stack) stack: v2

95. GRADUAL DEPLOYMENTS Desired: 1% Actual: 1% Desired: 98% Actual: 98% stack: v1 stack: v1’ (baseline stack) stack: v2 Observe metrics for the two stacks 95

96. GRADUAL DEPLOYMENTS Desired: 10% Actual: 10% Desired: 80% Actual: 80% stack: v1 stack: v1’ (baseline stack) stack: v2 Observe metrics for the two stacks 96

97. GRADUAL DEPLOYMENTS Desired: 10% Actual: 10% Desired: 80% Actual: 80% stack: v1 stack: v1’ (baseline stack) stack: v2 Observe metrics for the two stacks 97

98. GRADUAL DEPLOYMENTS Desired: 0% Actual: 0% Desired: 100% Actual: 100% stack: v1 stack: v1’ (baseline stack) stack: v2 !DEPLOYMENT FAILED! 98

99. OPENTRACING 99

100. APPLICATION LOGS 100

101. ALERT ON CUSTOMER IMPACT 101

102. CHALLENGES • Test coverage vs pipeline speed • Maintaining test/staging environments • Detecting (customer) impact quickly • Database schema changes (DB evolutions) • Keeping an overview of the microservice landscape 102

103. MICROSERVICES 103

104. MICROSERVICES 104 monzo.com/blog/we-built-network-isolation-for-1-500-services

105. ZALANDO: DEPLOYMENTS/WEEK ❏ ~300 ❏ ~3000 ❏ ~7000 ❏ ~10000 105

106. ZALANDO: DEPLOYMENTS/WEEK ❏ ~300 ❏ ~3000 ❏ ~7000 ❏ ~10000 106

107. CHALLENGES Database schema changes (DB evolutions) "The outage was caused by a change in the database schema of the service which wasn't detected earlier in the testing and deployment process. It was resolved by switching to the software version that was compatible with the changed database schema." 107

108. ADD COLUMN: SAFE WITHOUT DOWNTIME stack: v1 108 ALTER TABLE … ADD COLUMN ... stack: v2

109. DROP COLUMN: OUTAGE! stack: v1 109 ALTER TABLE … stack: v2 DROP COLUMN ...

110. OUTAGES CAUSED BY DEPLOYMENTS ❏ Systematic issues (e.g. lack of test coverage) should be addressed ❏ It's most important to find the accountable person and teach him/her how to do releases properly ❏ There is no single root cause for incidents, multiple contributing factors must be considered ❏ Outages should make the team do less deployments 110

111. OUTAGES CAUSED BY DEPLOYMENTS ❏ Systematic issues (e.g. lack of test coverage) should be addressed ❏ It's most important to find the accountable person and teach him/her how to do releases properly ❏ There is no single root cause for incidents, multiple contributing factors must be considered ❏ Outages should make the team do less deployments 111

112. SUMMARY • You build it, you run it! • Git with short-lived branches • Continuous Deployment as default • Confidence via automated tests, gradual rollouts • No trade-off between speed and stability/quality! • 4 key metrics: Deployment Frequency, Lead Time, Change Fail Rate, MTTR 112

113. OPEN SOURCE & MORE Open Source at Zalando opensource.zalando.com More Zalando Tech Talks github.com/zalando/public-presentations 113

114. QUESTIONS? HENNING JACOBS SENIOR PRINCIPAL henning@zalando.de @try_except_ Illustrations by @01k