A Field Guide to Reliability Engineering at Zalando

1. A Field Guide to Reliability Engineering at Zalando goto; Amsterdam 2024 - Heinrich Hartmann

2. 👋 I'm Heinrich - Reliability Engineer Experience Talking Reliability since 2015 Senior Principal SRE (2021) Chief Data Scientist (2015) • • • • • • SRECon - Statistics for Engineers DevOps Berlin - Zalando's quest to Operate 10K… SLOConf - The State of the Histogram P99 Conf - How to measure Latency FOSDEM - Latency SLOs Done Right Circllhist - A Histogram Data Structure… (arxiv) PhD in Mathematics (2011) Find me on heinrichhartmann.com LinkedIn, X goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

3. Menu 1. 2. 3. Principles Context Operations at Zalando a. Alerting b. Dashboards c. Observability d. Incident Process e. SLOs f. WORMs goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

4. Principles

5. Mission Protect the User Experience from operational failures while keeping an eye on (1) Developer Productivity and (2) On-Call Health. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

6. #1 Rule of Operations Obsess about User Experience. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

7. 7 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

8. #2 Rule of Operations Engineering for Reliability involves people as much as it involves technology. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

9. Engineering Reliability at Scale Small Company (~10 FTE) Medium Company (~100 FTE) Large Company (>1k FTE) - Alerts & Dashboards - Logging - Incident Management - Observability - On-call rotations - Playbooks - WORM Meeting - WORM Cascades - Risk Management - SRE Community & Guilds People Problems Technical Problems goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

10. Engineering Socio-Technological Systems with "Systems Theory" Example: Causal Loop Diagram - source: wikipedia goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

11. ℅ Martin Thwaites @ Honeycomb GOTO 2024 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

12. Reliability "Flywheel" at Zalando goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

13. Context

14. • One of the leading fashion platforms in EU • Founded in 2008 • 14.6 bn EUR Revenue / 50M+ active Customers • 25 Countries • 3K Tech Employees • 3K+ Micro Services goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

15. Zalando Service Graph goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

16. Don't separate People and Technology Conway's law Team Team Technology Structures mirror People Structures. Law of DevOps You build it, you run it! goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

17. Systems Model of Zalando ~25 Directors Management 250 teams Engineering Platform goto; Amsterdam 20204. Heinrich Hartmann @ Zalando 3,5k Applications k8s, Postgres, Kafka, ... CI/CD, GHE, … Telemetry Backends ~20 teams

18. Where do we stand? + Operating "transactional" Microservices + Protecting the Business + Preparing for High-Load Events - Understanding User Experience - Reliability of Data Systems / Business Processes goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

19. Operations at Zalando

20. Alerting

21. Why Alerting? Reduce Time to Detect user-facing issues. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

22. Alerting as Feedback Loop Faulty Operation 🔥 ALERTING! self healing Anomaly! 📟🧐 Problem Occured Normal Operation ⚙ goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Incident 🔨🧐 Mitigation

23. #3 Rule of Operations Alert on User Experience ("Symptoms") not on Server Experience ("Causes"). - Alert on error rates of user-facing "operations" - Leverage SLO-based Alerting (if available) - Don't alert on CPU Utilization goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

24. This is fine. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

25. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

26. Adding alerts trades Reliability of On-Call Health + - 26 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando -

27. Review On-Call Health Weekly! 27 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

28. Dashboards

29. Why Dashboards? • Reduce Time to Repair • Look at them when you get alerted. Don't monitor dashboards. • Starting point for understanding Service Health • Every Application MUST have an Application Dashboard. • Managed Services come with Managed Dashboards. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

30. Managed Kubernetes Dashboard goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

31. Managed REDIS Dashboard goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

32. Managed JVM Internals Dashboard goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

33. Zalando Application Dashboard Guidelines 1. 2. 3. 4. 5. 6. Golden Signals Entry Points Dependencies Saturation Operational Insights Storage courtesy of Evgeni Sokolov & Miha Lunar goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

34. Golden Signals Row - RED(S) Duration Requests Saturation Errors goto; Amsterdam 20204. Heinrich Hartmann @ Zalando w/ Evgeni Sokolov & Miha Lunar

35. Entry Points Row Golden Signals, again! - RED Duration Requests Errors POST /carts POST /card-details goto; Amsterdam 20204. Heinrich Hartmann @ Zalando w/ Evgeni Sokolov & Miha Lunar

36. Saturation Row … everything that can get saturated. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando w/ Evgeni Sokolov & Miha Lunar

37. Observability

38. Why Observability? • Reduce Time to Repair • Debug failures across team boundaries • Understand User-Experience • Basis for Alerting, Dashboards, Reporting, … goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

39. Traditional Monitoring ��🏼 �� Team Team Team Team Logs Logs Logs Logs Metrics Metrics Metrics Metrics goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Is my application healthy? Which errors does it throw?

40. Observability �� Team Team Team Traces �� goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Team Is the user happy? Which operation is failing?

41. Example Trace from Zalando Front Page �� goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Team "CIA" Application "CuCo"

42. Zalando Developer Observability Guidelines 1. Use OpenTelemetry to instrument Applications. 2. Use Distributed Tracing to understand system behavior in the context of transactions (e.g. HTTP requests). 3. Metrics for precise counts & global resource statistics 4. Structured Logging for Lifecycle events goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

43. Monitor Reliability of Operations with "RED" Metrics Operation: Reset Password Requests Errors Duration goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

44. Observability SDKs based on Open Telemetry #!/usr/bin/env python3 import observability_sdk as obs # Hook-up Zalando Backends obs.initialize() # Custom span @obs.trace(name=..., attributes={...}) def add_to_cart(): ... # Custom metric req_counter = obs.create_counter( name="total_requests", description="Total number of requests served", attributes = {...} unit="1", value_type=int, ) def handle_request(): req_counter.inc() goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

45. SLOs

46. Why SLOs? • Provide Top-Down understanding of Reliability provided to the user • Steer engineering investments into Reliability • Quantify impact of incidents • … also derive high-quality alerting rules goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

47. #4 Rule of Operations SLIs quantify the reliability of a User Experience. SLOs are Reliability targets for managerial steering. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

48. Zalando SLOs on Business Operations goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

49. SLO Table Reviewed by Management goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

50. SLOs are used to Prioritize Engineering Investments 50 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

51. SLOs are also used to tune Alerting Sensitivity 51 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

52. Decouple Alerting/Reporting SLOs to get more value! 52 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

53. Incident Process

54. #5 Rule of Operations Past Failures lead the way towards future Reliability. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

55. Incident Process as Feedback Loop Incident 🔥📟 reduce risk of Improvements ⚙🔨 Post Mortem 📄🧐 Action Items ✅❌ goto; Amsterdam 20204. Heinrich Hartmann @ Zalando Review

56. Zalando Incident Process 1. Impact 2. Root Cause 3. Action Items goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

57. Zalando Severity Definitions Ownership: Vice President SEV1 Example Incidents ● Order Drop ● AWS Zone Outage Ownership: Director SEV2 Example Incidents ● Payments processor degraded ● Order confirmation emails delayed Ownership: Head of Engineering SEV3 Example Incidents ● Users don't receive voucher ● Lounge users see not personalised articles goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

58. Incident Insights every Quarter GMV Loss Distribution by Root Cause in Q?/20?? goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

59. Weekly Operational Review Meeting

60. #6 Rule of Operations You get what you inspect. goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

61. Reliability Reports Supporting WORM Meetings on all Levels Auto Generated Google Doc WORM Agenda • • • • Incident Review -> Patterns? SLO Review Open Post Mortems On-Call Health goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

62. Zalando WORM Cascade Incident 🔥📟 reduce risk of Improvements ⚙🔨 Post Mortem 📄🧐 reviews WORM Meeting 🐛 feeds into Action Items ✅❌ WORM of WORM 🐛🐛 feeds into Global WORM 🐛🐛🐛 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando

63. Rules of Operations 1. Obsess about User Experience. 2. Engineering for Reliability involves People & Technology. 3. Alert on User Pain ("Symptoms") not Server Pain ("Causes"). 4. SLIs quantify the reliability of a User Experience. 5. Past Failures lead the way towards future Reliability. 6. You get what you inspect. Thank you! > Heinrich@HeinrichHartmann.com #Let's talk Reliability! 💚 goto; Amsterdam 20204. Heinrich Hartmann @ Zalando