A Field Guide to Reliability Engineering at Zalando
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. A Field Guide to
Reliability Engineering at Zalando
goto; Amsterdam 2024 - Heinrich Hartmann
2. 👋 I'm Heinrich - Reliability Engineer
Experience
Talking Reliability since 2015
Senior Principal SRE (2021)
Chief Data Scientist (2015)
•
•
•
•
•
•
SRECon - Statistics for Engineers
DevOps Berlin - Zalando's quest to Operate 10K…
SLOConf - The State of the Histogram
P99 Conf - How to measure Latency
FOSDEM - Latency SLOs Done Right
Circllhist - A Histogram Data Structure… (arxiv)
PhD in Mathematics (2011)
Find me on
heinrichhartmann.com
LinkedIn, X
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
3. Menu
1.
2.
3.
Principles
Context
Operations at Zalando
a. Alerting
b. Dashboards
c. Observability
d. Incident Process
e. SLOs
f. WORMs
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
4. Principles
5. Mission
Protect the User Experience from operational
failures while keeping an eye on (1) Developer
Productivity and (2) On-Call Health.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
6. #1 Rule of Operations
Obsess about User Experience.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
7. 7
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
8. #2 Rule of Operations
Engineering for Reliability
involves people as much as it
involves technology.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
9. Engineering Reliability at Scale
Small Company (~10 FTE) Medium Company (~100 FTE) Large Company (>1k FTE)
- Alerts & Dashboards
- Logging - Incident Management
- Observability
- On-call rotations
- Playbooks
- WORM Meeting - WORM Cascades
- Risk Management
- SRE Community & Guilds
People Problems
Technical Problems
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
10. Engineering Socio-Technological Systems
with "Systems Theory"
Example: Causal Loop Diagram - source: wikipedia
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
11. ℅ Martin Thwaites @ Honeycomb GOTO 2024
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
12. Reliability "Flywheel" at Zalando
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
13. Context
14. • One of the leading fashion platforms in EU
• Founded in 2008
• 14.6 bn EUR Revenue / 50M+ active
Customers
• 25 Countries
• 3K Tech Employees
• 3K+ Micro Services
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
15. Zalando
Service Graph
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
16. Don't separate People and Technology
Conway's law
Team
Team
Technology Structures mirror
People Structures.
Law of DevOps
You build it, you run it!
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
17. Systems Model of Zalando
~25
Directors
Management
250 teams
Engineering
Platform
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
3,5k
Applications
k8s, Postgres, Kafka, ...
CI/CD, GHE, …
Telemetry Backends
~20 teams
18. Where do we stand?
+ Operating "transactional" Microservices
+ Protecting the Business
+ Preparing for High-Load Events
- Understanding User Experience
- Reliability of Data Systems / Business Processes
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
19. Operations
at
Zalando
20. Alerting
21. Why Alerting?
Reduce Time to Detect user-facing issues.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
22. Alerting as Feedback Loop
Faulty Operation
🔥
ALERTING!
self healing
Anomaly!
📟🧐
Problem
Occured
Normal Operation
⚙
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
Incident
🔨🧐
Mitigation
23. #3 Rule of Operations
Alert on User Experience ("Symptoms")
not on Server Experience ("Causes").
- Alert on error rates of user-facing "operations"
- Leverage SLO-based Alerting (if available)
- Don't alert on CPU Utilization
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
24. This is fine.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
25. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
26. Adding alerts trades Reliability of On-Call Health
+
-
26 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
-
27. Review On-Call Health Weekly!
27 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
28. Dashboards
29. Why Dashboards?
• Reduce Time to Repair
• Look at them when you get alerted. Don't monitor dashboards.
• Starting point for understanding Service Health
• Every Application MUST have an Application Dashboard.
• Managed Services come with Managed Dashboards.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
30. Managed Kubernetes Dashboard
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
31. Managed REDIS Dashboard
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
32. Managed JVM Internals Dashboard
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
33. Zalando Application Dashboard Guidelines
1.
2.
3.
4.
5.
6.
Golden Signals
Entry Points
Dependencies
Saturation
Operational Insights
Storage
courtesy of Evgeni Sokolov & Miha Lunar
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
34. Golden Signals Row - RED(S)
Duration
Requests
Saturation
Errors
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
w/ Evgeni Sokolov & Miha Lunar
35. Entry Points Row
Golden Signals, again! - RED
Duration
Requests
Errors
POST /carts
POST /card-details
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
w/ Evgeni Sokolov & Miha Lunar
36. Saturation Row
… everything that can get saturated.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
w/ Evgeni Sokolov & Miha Lunar
37. Observability
38. Why Observability?
• Reduce Time to Repair
• Debug failures across team boundaries
• Understand User-Experience
• Basis for Alerting, Dashboards, Reporting, …
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
39. Traditional Monitoring
��🏼
��
Team Team Team Team
Logs Logs Logs Logs
Metrics Metrics Metrics Metrics
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
Is my application
healthy? Which
errors does it throw?
40. Observability
��
��
Team
Team
Team
Traces
��
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
Team
Is the user happy?
Which operation is
failing?
41. Example Trace from Zalando Front Page
��
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
Team
"CIA"
Application
"CuCo"
42. Zalando Developer Observability Guidelines
1. Use OpenTelemetry to instrument Applications.
2. Use Distributed Tracing to understand system behavior in the context of
transactions (e.g. HTTP requests).
3. Metrics for precise counts & global resource statistics
4. Structured Logging for Lifecycle events
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
43. Monitor Reliability of Operations with "RED" Metrics
Operation: Reset Password
Requests
Errors
Duration
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
44. Observability SDKs
based on Open Telemetry
#!/usr/bin/env python3
import observability_sdk as obs
# Hook-up Zalando Backends
obs.initialize()
# Custom span
@obs.trace(name=..., attributes={...})
def add_to_cart():
...
# Custom metric
req_counter = obs.create_counter(
name="total_requests",
description="Total number of requests served",
attributes = {...}
unit="1",
value_type=int,
)
def handle_request():
req_counter.inc()
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
45. SLOs
46. Why SLOs?
• Provide Top-Down understanding of Reliability provided to the user
• Steer engineering investments into Reliability
• Quantify impact of incidents
• … also derive high-quality alerting rules
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
47. #4 Rule of Operations
SLIs quantify the reliability of a User Experience.
SLOs are Reliability targets for managerial steering.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
48. Zalando SLOs on Business Operations
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
49. SLO Table Reviewed by Management
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
50. SLOs are used to Prioritize Engineering Investments
50 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
51. SLOs are also used to tune Alerting Sensitivity
51 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
52. Decouple Alerting/Reporting SLOs to get more value!
52 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
53. Incident
Process
54. #5 Rule of Operations
Past Failures lead the way towards future Reliability.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
55. Incident Process as Feedback Loop
Incident 🔥📟
reduce risk of
Improvements
⚙🔨
Post Mortem
📄🧐
Action Items
✅❌
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
Review
56. Zalando Incident Process
1. Impact
2. Root Cause
3. Action Items
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
57. Zalando Severity Definitions
Ownership: Vice President
SEV1 Example Incidents
● Order Drop
● AWS Zone Outage Ownership: Director
SEV2 Example Incidents
● Payments processor degraded
● Order confirmation emails delayed Ownership: Head of Engineering
SEV3 Example Incidents
● Users don't receive voucher
● Lounge users see not personalised articles
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
58. Incident Insights every Quarter
GMV Loss Distribution by Root Cause in Q?/20??
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
59. Weekly
Operational
Review
Meeting
60. #6 Rule of Operations
You get what you inspect.
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
61. Reliability Reports
Supporting WORM Meetings on all Levels
Auto Generated Google Doc
WORM Agenda
•
•
•
•
Incident Review -> Patterns?
SLO Review
Open Post Mortems
On-Call Health
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
62. Zalando WORM Cascade
Incident 🔥📟
reduce risk of
Improvements
⚙🔨
Post Mortem
📄🧐
reviews
WORM Meeting
🐛
feeds into
Action Items
✅❌
WORM of WORM
🐛🐛
feeds into
Global WORM
🐛🐛🐛
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
63. Rules of Operations
1. Obsess about User Experience.
2. Engineering for Reliability involves People & Technology.
3. Alert on User Pain ("Symptoms") not Server Pain ("Causes").
4. SLIs quantify the reliability of a User Experience.
5. Past Failures lead the way towards future Reliability.
6. You get what you inspect.
Thank you!
> Heinrich@HeinrichHartmann.com
#Let's talk Reliability! 💚
goto; Amsterdam 20204. Heinrich Hartmann @ Zalando