ALERTING MONITORING
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. ALERTING
MONITORING
AND ALL THAT JAZZ
Luis Mineiro @voidmaze
SRE @ Zalando
Coding Serbia, 16.05.2019
2. ZALANDO AT A GLANCE
~ 5.4
billion EUR
> 300
million
revenue 2018
> 15,500 > 80%
employees in
Europe of visits via
mobile devices
as of March 2019
visits
per
month
> 400,000
> 27 product choices
million ~ 2,000 17
brands countries
active customers
3. as of March 2019
4. WE ARE CONSTANTLY INNOVATING TECHNOLOGY
help our brand to
HOME-BREWED,
CUTTING-EDGE
& SCALABLE
WIN ONLINE
technology solutions
8
international
tech locations
HQs
in Berlin
> 2,000
employees at
5. Looks familiar?
6. TERMINOLOGY
MONITORING
Collecting, processing, aggregating, and displaying real-time quantitative data about a system,
such as query counts and types, error counts and types, processing times, and server lifetimes.
ALERT
A notification intended to be read by a human and that is pushed to a system such as a bug or
ticket queue, an email alias, or a pager.
SRE Book, Chapter 6: Monitoring Distributed Systems
7. MONITORING
Your monitoring system should address two questions: what’s broken, and why?
The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate)
cause.
"What" versus "why" is one of the most important distinctions in writing good monitoring with
maximum signal and minimum noise.
SRE Book, Chapter 6: Monitoring Distributed Systems
8. ALERTING CLASSIFICATION
Urgency Name Delivery
Will be addressed... eventually Report Dashboards or nowhere (/dev/null)
Predicted to fail "soon" Ticket An issue tracker or *cough*, Email
Urgently and actively get the attention of a
specific human Page A pager, cell phone or something
going *beep* *beep*
9. WHAT TO ALERT ON
Alerting should be both hard failure–centric and human-centric.
Distributed Systems Observability e-Book, Chapter 2: Monitoring and Observability
Symptoms are a better way to capture more problems more comprehensively and robustly with
less effort - "symptom-based monitoring," in contrast to "cause-based monitoring".
Rob Ewaschuk, "My Philosophy on Alerting"
Keep alerting simple, alert on symptoms. Aim to have as few alerts as possible, by alerting
on symptoms that are associated with end-user pain rather than trying to catch every possible
way that pain could be caused.
Prometheus Best Practices, https://prometheus.io/docs/practices/alerting/
10. SERVICE LEVEL OBJECTIVES
You should pick SLOs that represent the most critical aspects of the user experience.
Google Cloud Platform Blog, Building good SLOs - CRE life lessons
Start by thinking about (or finding out!) what your users care about, not what you can
measure.
Choose just enough SLOs to provide good coverage of your system’s attributes. Defend the
SLOs you pick: if you can’t ever win a conversation about priorities by quoting a particular SLO,
it’s probably not worth having that SLO.
SRE Book, Chapter 4 - Service Level Objectives
11. ALERTING STRATEGY
What to alert on:
"hard failure–centric and human-centric"
12. ALERTING STRATEGY
What to alert on:
"hard failure–centric and human-centric"
"symptom-based monitoring"
13. ALERTING STRATEGY
What to alert on:
"hard failure–centric and human-centric"
"symptom-based monitoring"
"alert on symptoms"
14. ALERTING STRATEGY
What to alert on:
"hard failure–centric and human-centric"
"symptom-based monitoring"
"alert on symptoms"
"symptoms that are associated with end-user pain"
15. ALERTING STRATEGY
Service Level Objectives:
"most critical aspects of the user experience"
16. ALERTING STRATEGY
Service Level Objectives:
"most critical aspects of the user experience"
"what your users care about"
17. ALERTING STRATEGY
"hard failure–centric and
human-centric"
"symptom-based
monitoring"
"alert on symptoms"
"symptoms that are
associated with end-user
pain"
=
"most critical aspects of the
user experience"
"what your users care about"
18. ALERTING STRATEGY
What to alert on:
"Keep alerting simple"
19. ALERTING STRATEGY
What to alert on:
"Keep alerting simple"
"Aim to have as few alerts as possible"
20. ALERTING STRATEGY
Service Level Objectives:
"just enough [...] to provide good coverage"
21. ALERTING STRATEGY
"Keep alerting simple"
"Aim to have as few
alerts as possible"
=
"just enough SLOs to provide
good coverage"
22. ALERTING STRATEGY
Service Level Objective = Symptom + Threshold
23. ALERTING STRATEGY
Page only when your SLO is missed
or in danger of being missed
24. ALERTING CHECKLIST
1.
Does this rule detect an otherwise undetected condition that is urgent, actionable, and
actively or imminently user-visible?
25. ALERTING CHECKLIST
1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and
actively or imminently user-visible?
2. Will I ever be able to ignore this alert, knowing it’s benign?
26. ALERTING CHECKLIST
1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and
actively or imminently user-visible?
2. Will I ever be able to ignore this alert, knowing it’s benign?
3. Does this alert definitely indicate that users are being negatively affected?
27. ALERTING CHECKLIST
1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and
actively or imminently user-visible?
2. Will I ever be able to ignore this alert, knowing it’s benign?
3. Does this alert definitely indicate that users are being negatively affected?
4. Can I take action in response to this alert?
28. ALERTING CHECKLIST
1. Does this rule detect an otherwise undetected condition that is urgent, actionable, and
actively or imminently user-visible?
2. Will I ever be able to ignore this alert, knowing it’s benign?
3. Does this alert definitely indicate that users are being negatively affected?
4. Can I take action in response to this alert?
5. Are other people getting paged for this issue?
SRE Book, Chapter 6: Monitoring Distributed Systems
29. ALERTING EXAMPLES
"Load average is high"
30. ALERTING EXAMPLES
"Cassandra node is down"
31. ALERTING EXAMPLES
"EC2 instance is unhealthy"
32. CREDIT
The majority of these slides were inspired or contained references to the excellent work from
many industry experts and publications:
People:
-
-
-
-
-
Rob Ewaschuk
Björn Rabenstein
Cindy Sridharan
Charity Majors
And many more...
Publications:
-
-
-
Site Reliability Engineering (Book)
The Site Reliability Workbook (Book)
Distributed Systems Observability
(e-Book)
33. ХВАЛА
QUESTIONS?
Don't miss my next talk tomorrow at 11:30
"Are we all on the same page? Let's fix that"
Luis Mineiro @voidmaze
We're Hiring!
https://jobs.zalando.com