A Field Guide to Reliability Engineering at Zalando

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. A Field Guide to Reliability Engineering at Zalando goto; Amsterdam 2024 - Heinrich Hartmann
2. 👋 I'm Heinrich - Reliability Engineer Experience Talking Reliability since 2015 Senior Principal SRE (2021) Chief Data Scientist (2015) • • • • • • SRECon - Statistics for Engineers DevOps Berlin - Zalando's quest to Operate 10K… SLOConf - The State of the Histogram P99 Conf - How to measure Latency FOSDEM - Latency SLOs Done Right Circllhist - A Histogram Data Structure… (arxiv) PhD in Mathematics (2011) Find me on heinrichhartmann.com LinkedIn, X goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
3. Menu 1. 2. 3. Principles Context Operations at Zalando a. Alerting b. Dashboards c. Observability d. Incident Process e. SLOs f. WORMs goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
4. Principles
5. Mission Protect the User Experience from operational failures while keeping an eye on (1) Developer Productivity and (2) On-Call Health. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
6. #1 Rule of Operations Obsess about User Experience. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
7. 7 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
8. #2 Rule of Operations Engineering for Reliability involves people as much as it involves technology. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
9. Engineering Reliability at Scale Small Company (~10 FTE) Medium Company (~100 FTE) Large Company (>1k FTE) - Alerts & Dashboards - Logging - Incident Management - Observability - On-call rotations - Playbooks - WORM Meeting - WORM Cascades - Risk Management - SRE Community & Guilds People Problems Technical Problems goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
10. Engineering Socio-Technological Systems with "Systems Theory" Example: Causal Loop Diagram - source: wikipedia goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
11. ℅ Martin Thwaites @ Honeycomb GOTO 2024 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
12. Reliability "Flywheel" at Zalando goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
13. Context
14. • One of the leading fashion platforms in EU • Founded in 2008 • 14.6 bn EUR Revenue / 50M+ active Customers • 25 Countries • 3K Tech Employees • 3K+ Micro Services goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
15. Zalando Service Graph goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
16. Don't separate People and Technology Conway's law Team Team Technology Structures mirror People Structures. Law of DevOps You build it, you run it! goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
17. Systems Model of Zalando ~25 Directors Management 250 teams Engineering Platform goto; Amsterdam 20204. Heinrich Hartmann @ Zalando 3,5k Applications k8s, Postgres, Kafka, ... CI/CD, GHE, … Telemetry Backends ~20 teams
18. Where do we stand? + Operating "transactional" Microservices + Protecting the Business + Preparing for High-Load Events - Understanding User Experience - Reliability of Data Systems / Business Processes goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
19. Operations at Zalando
20. Alerting
21. Why Alerting? Reduce Time to Detect user-facing issues. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
22. Alerting as Feedback Loop Faulty Operation 🔥 ALERTING! self healing Anomaly! 📟🧐 Problem Occured Normal Operation ⚙ goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Incident 🔨🧐 Mitigation
23. #3 Rule of Operations Alert on User Experience ("Symptoms") not on Server Experience ("Causes"). - Alert on error rates of user-facing "operations" - Leverage SLO-based Alerting (if available) - Don't alert on CPU Utilization goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
24. This is fine. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
25. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
26. Adding alerts trades Reliability of On-Call Health + - 26 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando -
27. Review On-Call Health Weekly! 27 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
28. Dashboards
29. Why Dashboards? • Reduce Time to Repair • Look at them when you get alerted. Don't monitor dashboards. • Starting point for understanding Service Health • Every Application MUST have an Application Dashboard. • Managed Services come with Managed Dashboards. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
30. Managed Kubernetes Dashboard goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
31. Managed REDIS Dashboard goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
32. Managed JVM Internals Dashboard goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
33. Zalando Application Dashboard Guidelines 1. 2. 3. 4. 5. 6. Golden Signals Entry Points Dependencies Saturation Operational Insights Storage courtesy of Evgeni Sokolov & Miha Lunar goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
34. Golden Signals Row - RED(S) Duration Requests Saturation Errors goto; Amsterdam 20204. Heinrich Hartmann @ Zalando w/ Evgeni Sokolov & Miha Lunar
35. Entry Points Row Golden Signals, again! - RED Duration Requests Errors POST /carts POST /card-details goto; Amsterdam 20204. Heinrich Hartmann @ Zalando w/ Evgeni Sokolov & Miha Lunar
36. Saturation Row … everything that can get saturated. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando w/ Evgeni Sokolov & Miha Lunar
37. Observability
38. Why Observability? • Reduce Time to Repair • Debug failures across team boundaries • Understand User-Experience • Basis for Alerting, Dashboards, Reporting, … goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
39. Traditional Monitoring ��🏼 �� Team Team Team Team Logs Logs Logs Logs Metrics Metrics Metrics Metrics goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Is my application healthy? Which errors does it throw?
40. Observability �� �� Team Team Team Traces �� goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Team Is the user happy? Which operation is failing?
41. Example Trace from Zalando Front Page �� goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Team "CIA" Application "CuCo"
42. Zalando Developer Observability Guidelines 1. Use OpenTelemetry to instrument Applications. 2. Use Distributed Tracing to understand system behavior in the context of transactions (e.g. HTTP requests). 3. Metrics for precise counts & global resource statistics 4. Structured Logging for Lifecycle events goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
43. Monitor Reliability of Operations with "RED" Metrics Operation: Reset Password Requests Errors Duration goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
44. Observability SDKs based on Open Telemetry #!/usr/bin/env python3 import observability_sdk as obs # Hook-up Zalando Backends obs.initialize() # Custom span @obs.trace(name=..., attributes={...}) def add_to_cart(): ... # Custom metric req_counter = obs.create_counter( name="total_requests", description="Total number of requests served", attributes = {...} unit="1", value_type=int, ) def handle_request(): req_counter.inc() goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
45. SLOs
46. Why SLOs? • Provide Top-Down understanding of Reliability provided to the user • Steer engineering investments into Reliability • Quantify impact of incidents • … also derive high-quality alerting rules goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
47. #4 Rule of Operations SLIs quantify the reliability of a User Experience. SLOs are Reliability targets for managerial steering. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
48. Zalando SLOs on Business Operations goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
49. SLO Table Reviewed by Management goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
50. SLOs are used to Prioritize Engineering Investments 50 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
51. SLOs are also used to tune Alerting Sensitivity 51 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
52. Decouple Alerting/Reporting SLOs to get more value! 52 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
53. Incident Process
54. #5 Rule of Operations Past Failures lead the way towards future Reliability. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
55. Incident Process as Feedback Loop Incident 🔥📟 reduce risk of Improvements ⚙🔨 Post Mortem 📄🧐 Action Items ✅❌ goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Review
56. Zalando Incident Process 1. Impact 2. Root Cause 3. Action Items goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
57. Zalando Severity Definitions Ownership: Vice President SEV1 Example Incidents ● Order Drop ● AWS Zone Outage Ownership: Director SEV2 Example Incidents ● Payments processor degraded ● Order confirmation emails delayed Ownership: Head of Engineering SEV3 Example Incidents ● Users don't receive voucher ● Lounge users see not personalised articles goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
58. Incident Insights every Quarter GMV Loss Distribution by Root Cause in Q?/20?? goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
59. Weekly Operational Review Meeting
60. #6 Rule of Operations You get what you inspect. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
61. Reliability Reports Supporting WORM Meetings on all Levels Auto Generated Google Doc WORM Agenda • • • • Incident Review -> Patterns? SLO Review Open Post Mortems On-Call Health goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
62. Zalando WORM Cascade Incident 🔥📟 reduce risk of Improvements ⚙🔨 Post Mortem 📄🧐 reviews WORM Meeting 🐛 feeds into Action Items ✅❌ WORM of WORM 🐛🐛 feeds into Global WORM 🐛🐛🐛 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando
63. Rules of Operations 1. Obsess about User Experience. 2. Engineering for Reliability involves People & Technology. 3. Alert on User Pain ("Symptoms") not Server Pain ("Causes"). 4. SLIs quantify the reliability of a User Experience. 5. Past Failures lead the way towards future Reliability. 6. You get what you inspect. Thank you! > Heinrich@HeinrichHartmann.com #Let's talk Reliability! 💚 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-04 11:12
浙ICP备14020137号-1 $访客地图$