Building and running applications at scale in Zalando
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. Building and running
applications at scale
in Zalando
Online fashion store Checkout case
By Pamela Canchanya
2. About Zalando
3.
4. About Zalando
~ 5.4
billion EUR
> 250
million
revenue 2018
> 15.500 > 70%
employees in
Europe of visits via
mobile devices
visits
per
month
> 300.000
> 26 product choices
million ~ 2.000 17
brands countries
active customers
5. Black Friday
at a glance
6. Zalando Tech
7. From monolith to microservice architecture
Reorganization
> 1000
microservices
8. Tech organization
> 200
> 1100
development
teams
developers
Platform
9. End to end responsibility
10. Checkout
Goal
“Allow customers to
buy seamlessly and
conveniently”
11. Checkout landscape
REST & messaging
Java
Scala
Node JS
Communication
AWS
programming languages
Cassandra ETCD
data storage
configurations
React
client side
&
Kubernetes Docker
infrastructure
container
Many
more
12. Checkout architecture
Dependencies
Tailor
Backend
for
frontend
Frontend
fragments
Cassandra
Checkout
service
Dependencies
Skipper
Dependencies
13. Checkout is a critical
component in the shopping journey
- Direct impact in business revenue
- Direct impact in customer experience
14. Checkout challenges
in a microservice ecosystem
- Increase points of failures
- Multiple dependencies evolving independently
15. Lessons learnt building
Checkout with
- Reliability patterns
- Scalability
- Monitoring
16. Building microservices
with reliability patterns
17. Checkout confirmation page
Delivery
Destination
Delivery
Service
Payments
Service
Cart
18. Checkout confirmation page
Delivery
Service
19. Unwanted error
20. Doing retries
for (var i = 1; i < numRetries; i++) {
try {
return getDeliveryOptionsForCheckout(cart)
} catch(error) {
if (i >= numRetries) {
throw error;
}
}
}
21. Retry for transient errors
like a network error
or service overload
22. Retries for some errors
try {
getDeliveryOptionsForCheckout(cart) match {
case Success() => // return result
case TransientFailure => // retry operation
case Error => // throw error
}
}
catch {
println("Delivery options exception")
}
23. Retries with exponential backoff
Attempt 1
Attempt 3
Attempt 2
100 ms
100 ms
Exponential
Backoff time
100 ms
Exponential
Backoff time
24. Exhaustion of retries and failures become permanent
25. Prevent execution of
operations that are
likely to fail
26. Circuit breaker pattern
Circuit breaker pattern - Martin
Fowler blog post
27. Open circuit, operations fails immediately
error rate > threshold 50%
Target
getDeliveryOptionsForCheckout = failure
28. Fallback as alternative of failure
Unwanted failure: no Checkout
Fallback: Only Standard delivery
service with a default delivery
promise
29. Putting all together
Do retries of operations with exponential backoff
Wrap operations with a circuit breaker
Handle failures with fallbacks when possible
Otherwise make sure to handle the exceptions
circuitCommand(
getDeloveryOptionsForCheckout(cart)
.retry(2)
)
.onSuccess(//do something with result)
.onError(getDeloveryOptionsForCheckoutFallback)
30. Scaling microservices
31. Traffic pattern
32. Traffic pattern
33. Microservice infrastructure
Incoming requests
Load balancer
Use Zalando
base image
Instance
Container
Node env
JVM env
Instance
Instance
Distributed
by instance
34. Scaling horizontally
Load balancer
Instance
Container
Instance
Instance
35. Scaling horizontally
Load balancer
Instance
Container
Instance
Instance
Instance
36. Scaling vertically
Load balancer
Instance
Container
Instance
Instance
37. Scaling vertically
Load balancer
Instance
Container
Instance
Instance
38. Scaling consequences
Cassandra
> service connections
> saturation and risk of
unhealthy database
39. Microservices cannot be
scalable if downstream
microservices cannot scale
40. Low traffic rollouts
1 2
3 4
Service v2
Traffic 0%
1 2
3 4
Service v1
Traffic 100%
41. High traffic rollouts
1 2 1 2 3
3 4 4 5 6
Service v2
Traffic 0%
Service v1
Traffic 100%
42. Rollout with not enough capacity
43. Rollouts should consider
allocate same capacity like
version with 100% traffic
44. Monitor microservices
45. Monitoring microservice ecosystem
Microservice
Application platform
Communication
Hardware
Four layer model of microservice
ecosystem
46. Monitoring microservice ecosystem
Microservice
Application platform
Communication
Hardware
For layer model of microservice ecosystem
Infrastructure metrics
47. Monitoring microservice ecosystem
Microservice
Application platform
Communication
Hardware
For layer model of microservice ecosystem
Microservice
metrics
48. First example
49. Hardware metrics
50. Communication metrics
51. Rate and responses of API endpoints
52. Dependencies metrics
53. Language specific metrics
54. Second Example
55. Infrastructure metrics
56. Node JS metrics
57. Frontend microservice metrics
58. Anti pattern: Dashboard
usage for outage
detection
59. Alerting
“Something is broken, and somebody needs to fix it right now! Or,
something might break soon, so somebody should look soon.”
Practical Alerting - Monitoring distributed systems
Google SRE Book
60. Alert
Unhealthy instances 1 of 5
No more memory, JVM is misconfigured
61. Alert
Service checkout is returning 4XXs
responses above threshold 25%
Recent change broke contract of API for
unconsidered business rule
62. Alert
No orders in last 5 minutes
Downstream dependency is
experimenting connectivity issues
63. Alert
Checkout database disk utilization is
80%
Saturation of data storage by an increase
in traffic
64. Alerts notify about
symptoms
65. Alerts should be
actionable
66. Incident response
Figure Five stages of incident response.
Microservices ready to production
67. Example of postmortem
Summary of incident
No orders in last 5 minutes 13.05.2019 between 16:00 and 16:45
Impact of customers
2K customers could not complete checkout
Impact of business
50K euros loss of order that could be completed
Analysis of root cause
Why there was no orders?
Action items
...
68. Every incident should
have postmortem
69.
70. Preparing for Black Friday
- Business forecast
- Load testing of real customer journey
- Capacity planning
71. Checklist for every microservice involved in Black Friday
-
-
-
-
-
-
-
-
Is the architecture and dependencies reviewed?
Are the possible point of failures identified and mitigated?
Are reliability patterns implemented?
Are the configurations adjustable without need of deployment?
Do we have scaling strategy?
Is monitoring in place?
Are all alerts actionable?
Is our team prepared for 24x7 incident management?
72. Situation room
73. Black Friday pattern of requests
> 4,200
orders/m
74. My summary of learnings
- Think outside the happy path and
mitigate failures with reliability patterns
- Services are scalable proportionally
with their dependencies
- Monitor the microservice ecosystem
75. Resources
-
-
-
-
-
-
-
Service reliability engineering
Production ready micro services
Monitoring and alerting Tool used by Zalando
Taylor
Skipper
Load testing in Zalando
Kubernertes in Zalando
76. Obrigada
Thank you
Danke
Contact
Pamela Canchanya
pam.cdm@posteo.net
@pamcdm
77. Building and running
applications at scale
in Zalando
Online fashion store Checkout case
By Pamela Canchanya