ALERTING MONITORING

1. ALERTING MONITORING AND ALL THAT JAZZ Luis Mineiro @voidmaze SRE @ Zalando Coding Serbia, 16.05.2019

2. ZALANDO AT A GLANCE ~ 5.4 billion EUR > 300 million revenue 2018 > 15,500 > 80% employees in Europe of visits via mobile devices as of March 2019 visits per month > 400,000 > 27 product choices million ~ 2,000 17 brands countries active customers

3. as of March 2019

4. WE ARE CONSTANTLY INNOVATING TECHNOLOGY help our brand to HOME-BREWED, CUTTING-EDGE & SCALABLE WIN ONLINE technology solutions 8 international tech locations HQs in Berlin > 2,000 employees at

5. Looks familiar?

6. TERMINOLOGY MONITORING Collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times, and server lifetimes. ALERT A notification intended to be read by a human and that is pushed to a system such as a bug or ticket queue, an email alias, or a pager. SRE Book, Chapter 6: Monitoring Distributed Systems

7. MONITORING Your monitoring system should address two questions: what’s broken, and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause. "What" versus "why" is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise. SRE Book, Chapter 6: Monitoring Distributed Systems

8. ALERTING CLASSIFICATION Urgency Name Delivery Will be addressed... eventually Report Dashboards or nowhere (/dev/null) Predicted to fail "soon" Ticket An issue tracker or *cough*, Email Urgently and actively get the attention of a specific human Page A pager, cell phone or something going *beep* *beep*

9. WHAT TO ALERT ON Alerting should be both hard failure–centric and human-centric. Distributed Systems Observability e-Book, Chapter 2: Monitoring and Observability Symptoms are a better way to capture more problems more comprehensively and robustly with less effort - "symptom-based monitoring," in contrast to "cause-based monitoring". Rob Ewaschuk, "My Philosophy on Alerting" Keep alerting simple, alert on symptoms. Aim to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused. Prometheus Best Practices, https://prometheus.io/docs/practices/alerting/

10. SERVICE LEVEL OBJECTIVES You should pick SLOs that represent the most critical aspects of the user experience. Google Cloud Platform Blog, Building good SLOs - CRE life lessons Start by thinking about (or finding out!) what your users care about, not what you can measure. Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO, it’s probably not worth having that SLO. SRE Book, Chapter 4 - Service Level Objectives

11. ALERTING STRATEGY What to alert on: "hard failure–centric and human-centric"

12. ALERTING STRATEGY What to alert on: "hard failure–centric and human-centric" "symptom-based monitoring"

13. ALERTING STRATEGY What to alert on: "hard failure–centric and human-centric" "symptom-based monitoring" "alert on symptoms"

14. ALERTING STRATEGY What to alert on: "hard failure–centric and human-centric" "symptom-based monitoring" "alert on symptoms" "symptoms that are associated with end-user pain"

15. ALERTING STRATEGY Service Level Objectives: "most critical aspects of the user experience"

16. ALERTING STRATEGY Service Level Objectives: "most critical aspects of the user experience" "what your users care about"

17. ALERTING STRATEGY "hard failure–centric and human-centric" "symptom-based monitoring" "alert on symptoms" "symptoms that are associated with end-user pain" = "most critical aspects of the user experience" "what your users care about"

18. ALERTING STRATEGY What to alert on: "Keep alerting simple"

19. ALERTING STRATEGY What to alert on: "Keep alerting simple" "Aim to have as few alerts as possible"

20. ALERTING STRATEGY Service Level Objectives: "just enough [...] to provide good coverage"

21. ALERTING STRATEGY "Keep alerting simple" "Aim to have as few alerts as possible" = "just enough SLOs to provide good coverage"

22. ALERTING STRATEGY Service Level Objective = Symptom + Threshold

23. ALERTING STRATEGY Page only when your SLO is missed or in danger of being missed

24. ALERTING CHECKLIST 1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible?

25. ALERTING CHECKLIST 1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible? 2. Will I ever be able to ignore this alert, knowing it’s benign?

26. ALERTING CHECKLIST 1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible? 2. Will I ever be able to ignore this alert, knowing it’s benign? 3. Does this alert definitely indicate that users are being negatively affected?

27. ALERTING CHECKLIST 1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible? 2. Will I ever be able to ignore this alert, knowing it’s benign? 3. Does this alert definitely indicate that users are being negatively affected? 4. Can I take action in response to this alert?

28. ALERTING CHECKLIST 1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visible? 2. Will I ever be able to ignore this alert, knowing it’s benign? 3. Does this alert definitely indicate that users are being negatively affected? 4. Can I take action in response to this alert? 5. Are other people getting paged for this issue? SRE Book, Chapter 6: Monitoring Distributed Systems

29. ALERTING EXAMPLES "Load average is high"

30. ALERTING EXAMPLES "Cassandra node is down"

31. ALERTING EXAMPLES "EC2 instance is unhealthy"

32. CREDIT The majority of these slides were inspired or contained references to the excellent work from many industry experts and publications: People: - - - - - Rob Ewaschuk Björn Rabenstein Cindy Sridharan Charity Majors And many more... Publications: - - - Site Reliability Engineering (Book) The Site Reliability Workbook (Book) Distributed Systems Observability (e-Book)

33. ХВАЛА QUESTIONS? Don't miss my next talk tomorrow at 11:30 "Are we all on the same page? Let's fix that" Luis Mineiro @voidmaze We're Hiring! https://jobs.zalando.com