Building and running applications at scale in Zalando

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya
2. About Zalando
3.
4. About Zalando ~ 5.4 billion EUR > 250 million revenue 2018 > 15.500 > 70% employees in Europe of visits via mobile devices visits per month > 300.000 > 26 product choices million ~ 2.000 17 brands countries active customers
5. Black Friday at a glance
6. Zalando Tech
7. From monolith to microservice architecture Reorganization > 1000 microservices
8. Tech organization > 200 > 1100 development teams developers Platform
9. End to end responsibility
10. Checkout Goal “Allow customers to buy seamlessly and conveniently”
11. Checkout landscape REST & messaging Java Scala Node JS Communication AWS programming languages Cassandra ETCD data storage configurations React client side & Kubernetes Docker infrastructure container Many more
12. Checkout architecture Dependencies Tailor Backend for frontend Frontend fragments Cassandra Checkout service Dependencies Skipper Dependencies
13. Checkout is a critical component in the shopping journey - Direct impact in business revenue - Direct impact in customer experience
14. Checkout challenges in a microservice ecosystem - Increase points of failures - Multiple dependencies evolving independently
15. Lessons learnt building Checkout with - Reliability patterns - Scalability - Monitoring
16. Building microservices with reliability patterns
17. Checkout confirmation page Delivery Destination Delivery Service Payments Service Cart
18. Checkout confirmation page Delivery Service
19. Unwanted error
20. Doing retries for (var i = 1; i < numRetries; i++) { try { return getDeliveryOptionsForCheckout(cart) } catch(error) { if (i >= numRetries) { throw error; } } }
21. Retry for transient errors like a network error or service overload
22. Retries for some errors try { getDeliveryOptionsForCheckout(cart) match { case Success() => // return result case TransientFailure => // retry operation case Error => // throw error } } catch { println("Delivery options exception") }
23. Retries with exponential backoff Attempt 1 Attempt 3 Attempt 2 100 ms 100 ms Exponential Backoff time 100 ms Exponential Backoff time
24. Exhaustion of retries and failures become permanent
25. Prevent execution of operations that are likely to fail
26. Circuit breaker pattern Circuit breaker pattern - Martin Fowler blog post
27. Open circuit, operations fails immediately error rate > threshold 50% Target getDeliveryOptionsForCheckout = failure
28. Fallback as alternative of failure Unwanted failure: no Checkout Fallback: Only Standard delivery service with a default delivery promise
29. Putting all together Do retries of operations with exponential backoff Wrap operations with a circuit breaker Handle failures with fallbacks when possible Otherwise make sure to handle the exceptions circuitCommand( getDeloveryOptionsForCheckout(cart) .retry(2) ) .onSuccess(//do something with result) .onError(getDeloveryOptionsForCheckoutFallback)
30. Scaling microservices
31. Traffic pattern
32. Traffic pattern
33. Microservice infrastructure Incoming requests Load balancer Use Zalando base image Instance Container Node env JVM env Instance Instance Distributed by instance
34. Scaling horizontally Load balancer Instance Container Instance Instance
35. Scaling horizontally Load balancer Instance Container Instance Instance Instance
36. Scaling vertically Load balancer Instance Container Instance Instance
37. Scaling vertically Load balancer Instance Container Instance Instance
38. Scaling consequences Cassandra > service connections > saturation and risk of unhealthy database
39. Microservices cannot be scalable if downstream microservices cannot scale
40. Low traffic rollouts 1 2 3 4 Service v2 Traffic 0% 1 2 3 4 Service v1 Traffic 100%
41. High traffic rollouts 1 2 1 2 3 3 4 4 5 6 Service v2 Traffic 0% Service v1 Traffic 100%
42. Rollout with not enough capacity
43. Rollouts should consider allocate same capacity like version with 100% traffic
44. Monitor microservices
45. Monitoring microservice ecosystem Microservice Application platform Communication Hardware Four layer model of microservice ecosystem
46. Monitoring microservice ecosystem Microservice Application platform Communication Hardware For layer model of microservice ecosystem Infrastructure metrics
47. Monitoring microservice ecosystem Microservice Application platform Communication Hardware For layer model of microservice ecosystem Microservice metrics
48. First example
49. Hardware metrics
50. Communication metrics
51. Rate and responses of API endpoints
52. Dependencies metrics
53. Language specific metrics
54. Second Example
55. Infrastructure metrics
56. Node JS metrics
57. Frontend microservice metrics
58. Anti pattern: Dashboard usage for outage detection
59. Alerting “Something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should look soon.” Practical Alerting - Monitoring distributed systems Google SRE Book
60. Alert Unhealthy instances 1 of 5 No more memory, JVM is misconfigured
61. Alert Service checkout is returning 4XXs responses above threshold 25% Recent change broke contract of API for unconsidered business rule
62. Alert No orders in last 5 minutes Downstream dependency is experimenting connectivity issues
63. Alert Checkout database disk utilization is 80% Saturation of data storage by an increase in traffic
64. Alerts notify about symptoms
65. Alerts should be actionable
66. Incident response Figure Five stages of incident response. Microservices ready to production
67. Example of postmortem Summary of incident No orders in last 5 minutes 13.05.2019 between 16:00 and 16:45 Impact of customers 2K customers could not complete checkout Impact of business 50K euros loss of order that could be completed Analysis of root cause Why there was no orders? Action items ...
68. Every incident should have postmortem
69.
70. Preparing for Black Friday - Business forecast - Load testing of real customer journey - Capacity planning
71. Checklist for every microservice involved in Black Friday - - - - - - - - Is the architecture and dependencies reviewed? Are the possible point of failures identified and mitigated? Are reliability patterns implemented? Are the configurations adjustable without need of deployment? Do we have scaling strategy? Is monitoring in place? Are all alerts actionable? Is our team prepared for 24x7 incident management?
72. Situation room
73. Black Friday pattern of requests > 4,200 orders/m
74. My summary of learnings - Think outside the happy path and mitigate failures with reliability patterns - Services are scalable proportionally with their dependencies - Monitor the microservice ecosystem
75. Resources - - - - - - - Service reliability engineering Production ready micro services Monitoring and alerting Tool used by Zalando Taylor Skipper Load testing in Zalando Kubernertes in Zalando
76. Obrigada Thank you Danke Contact Pamela Canchanya pam.cdm@posteo.net @pamcdm
77. Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya

Accueil - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-05 11:53
浙ICP备14020137号-1 $Carte des visiteurs$