Building and running applications at scale in Zalando

1. Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya

2. About Zalando

3.

4. About Zalando ~ 5.4 billion EUR > 250 million revenue 2018 > 15.500 > 70% employees in Europe of visits via mobile devices visits per month > 300.000 > 26 product choices million ~ 2.000 17 brands countries active customers

5. Black Friday at a glance

6. Zalando Tech

7. From monolith to microservice architecture Reorganization > 1000 microservices

8. Tech organization > 200 > 1100 development teams developers Platform

9. End to end responsibility

10. Checkout Goal “Allow customers to buy seamlessly and conveniently”

11. Checkout landscape REST & messaging Java Scala Node JS Communication AWS programming languages Cassandra ETCD data storage configurations React client side & Kubernetes Docker infrastructure container Many more

12. Checkout architecture Dependencies Tailor Backend for frontend Frontend fragments Cassandra Checkout service Dependencies Skipper Dependencies

13. Checkout is a critical component in the shopping journey - Direct impact in business revenue - Direct impact in customer experience

14. Checkout challenges in a microservice ecosystem - Increase points of failures - Multiple dependencies evolving independently

15. Lessons learnt building Checkout with - Reliability patterns - Scalability - Monitoring

16. Building microservices with reliability patterns

17. Checkout confirmation page Delivery Destination Delivery Service Payments Service Cart

18. Checkout confirmation page Delivery Service

19. Unwanted error

20. Doing retries for (var i = 1; i < numRetries; i++) { try { return getDeliveryOptionsForCheckout(cart) } catch(error) { if (i >= numRetries) { throw error; } } }

21. Retry for transient errors like a network error or service overload

22. Retries for some errors try { getDeliveryOptionsForCheckout(cart) match { case Success() => // return result case TransientFailure => // retry operation case Error => // throw error } } catch { println("Delivery options exception") }

23. Retries with exponential backoff Attempt 1 Attempt 3 Attempt 2 100 ms 100 ms Exponential Backoff time 100 ms Exponential Backoff time

24. Exhaustion of retries and failures become permanent

25. Prevent execution of operations that are likely to fail

26. Circuit breaker pattern Circuit breaker pattern - Martin Fowler blog post

27. Open circuit, operations fails immediately error rate > threshold 50% Target getDeliveryOptionsForCheckout = failure

28. Fallback as alternative of failure Unwanted failure: no Checkout Fallback: Only Standard delivery service with a default delivery promise

29. Putting all together Do retries of operations with exponential backoff Wrap operations with a circuit breaker Handle failures with fallbacks when possible Otherwise make sure to handle the exceptions circuitCommand( getDeloveryOptionsForCheckout(cart) .retry(2) ) .onSuccess(//do something with result) .onError(getDeloveryOptionsForCheckoutFallback)

30. Scaling microservices

31. Traffic pattern

32. Traffic pattern

33. Microservice infrastructure Incoming requests Load balancer Use Zalando base image Instance Container Node env JVM env Instance Instance Distributed by instance

34. Scaling horizontally Load balancer Instance Container Instance Instance

35. Scaling horizontally Load balancer Instance Container Instance Instance Instance

36. Scaling vertically Load balancer Instance Container Instance Instance

37. Scaling vertically Load balancer Instance Container Instance Instance

38. Scaling consequences Cassandra > service connections > saturation and risk of unhealthy database

39. Microservices cannot be scalable if downstream microservices cannot scale

40. Low traffic rollouts 1 2 3 4 Service v2 Traffic 0% 1 2 3 4 Service v1 Traffic 100%

41. High traffic rollouts 1 2 1 2 3 3 4 4 5 6 Service v2 Traffic 0% Service v1 Traffic 100%

42. Rollout with not enough capacity

43. Rollouts should consider allocate same capacity like version with 100% traffic

44. Monitor microservices

45. Monitoring microservice ecosystem Microservice Application platform Communication Hardware Four layer model of microservice ecosystem

46. Monitoring microservice ecosystem Microservice Application platform Communication Hardware For layer model of microservice ecosystem Infrastructure metrics

47. Monitoring microservice ecosystem Microservice Application platform Communication Hardware For layer model of microservice ecosystem Microservice metrics

48. First example

49. Hardware metrics

50. Communication metrics

51. Rate and responses of API endpoints

52. Dependencies metrics

53. Language specific metrics

54. Second Example

55. Infrastructure metrics

56. Node JS metrics

57. Frontend microservice metrics

58. Anti pattern: Dashboard usage for outage detection

59. Alerting “Something is broken, and somebody needs to ﬁx it right now! Or, something might break soon, so somebody should look soon.” Practical Alerting - Monitoring distributed systems Google SRE Book

60. Alert Unhealthy instances 1 of 5 No more memory, JVM is misconfigured

61. Alert Service checkout is returning 4XXs responses above threshold 25% Recent change broke contract of API for unconsidered business rule

62. Alert No orders in last 5 minutes Downstream dependency is experimenting connectivity issues

63. Alert Checkout database disk utilization is 80% Saturation of data storage by an increase in traffic

64. Alerts notify about symptoms

65. Alerts should be actionable

66. Incident response Figure Five stages of incident response. Microservices ready to production

67. Example of postmortem Summary of incident No orders in last 5 minutes 13.05.2019 between 16:00 and 16:45 Impact of customers 2K customers could not complete checkout Impact of business 50K euros loss of order that could be completed Analysis of root cause Why there was no orders? Action items ...

68. Every incident should have postmortem

69.

70. Preparing for Black Friday - Business forecast - Load testing of real customer journey - Capacity planning

71. Checklist for every microservice involved in Black Friday - - - - - - - - Is the architecture and dependencies reviewed? Are the possible point of failures identified and mitigated? Are reliability patterns implemented? Are the configurations adjustable without need of deployment? Do we have scaling strategy? Is monitoring in place? Are all alerts actionable? Is our team prepared for 24x7 incident management?

72. Situation room

73. Black Friday pattern of requests > 4,200 orders/m

74. My summary of learnings - Think outside the happy path and mitigate failures with reliability patterns - Services are scalable proportionally with their dependencies - Monitor the microservice ecosystem

75. Resources - - - - - - - Service reliability engineering Production ready micro services Monitoring and alerting Tool used by Zalando Taylor Skipper Load testing in Zalando Kubernertes in Zalando

76. Obrigada Thank you Danke Contact Pamela Canchanya pam.cdm@posteo.net @pamcdm

77. Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya