ARE WE ALL ON THE SAME PAGE2?
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. ARE WE ALL ON
THE SAME PAGE?
LET'S FIX THAT
Luis Mineiro @voidmaze
SRE @ Zalando
SREcon EMEA 2019
2. ZALANDO AT A GLANCE
~ 5.4
billion EUR
> 300
million
revenue 2018
> 15,500 > 80%
employees in
Europe of visits via
mobile devices
as of October 2019
visits
per
month
> 400,000
> 27 product choices
million ~ 2,000 17
brands countries
active customers
3. as of October 2019
4. Photo by Dawn Armfield on Unsplash
5. THE AGE OF THE MONOLITH
Request
Single, large boxes
that did everything
Jimmy
The Monolith
Response
6. MONITORING THE MONOLITH
Ops Monitoring
● Is the box alive?
● Is the monolith process up?
Devs Monitoring
● Are requests returning errors?
● Are requests reasonably fast?
Photo by Deneen LT on Pexels
7. MODERN MICROSERVICES ARCHITECTURES
Amazon internal service dependency visualization
8. EXAMPLE - PLACING AN ORDER
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Risk
Service
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
9. MONITORING MICROSERVICES
"DevOps" Monitoring
● Is the box alive?
● Is the micro-service process up?
● Are requests returning errors?
● Are requests reasonably fast?
Photo by Antoine Plüss on Unsplash
10. FAILURE PLACING AN ORDER
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Risk
Service
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
🤔
11. ALERTS ON FAILURE PLACING AN ORDER
👤
Web
Frontend
��
Checkout
Service
��
Payment
Gateway
��
Payment
Service
Customer
��
Typical
Payment
Blackbox
��
Risk
Service
��
A Queue
of Sorts
��
Order
Service
��
Another
Shady
Service
��
��
Logistics
Service
��
Coupon
Service
��
Stock
Reservation
Service
��
Machine
Learning
Shenanigans
Random BI
Service
Accounting
Service
⚠
12. ALERTS ON FAILURE PLACING AN ORDER
👤
Web
Frontend
��
Checkout
Service
��
Payment
Gateway
��
Payment
Service
Customer
��
Typical
Payment
Blackbox
��
Risk
Service
��
A Queue
of Sorts
��
Order
Service
��
Another
Shady
Service
��
��
Logistics
Service
��
Coupon
Service
��
Stock
Reservation
Service
��
Machine
Learning
Shenanigans
Random BI
Service
Accounting
Service
Photo by Antoine Plüss on Unsplash
⚠
13. SYMPTOM BASED ALERTING RULE
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Good signal to
noise ratio.
Create an alert
rule "here"
Risk
Service
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
14. ALERT ON THE SYMPTOM
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Risk
Service
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
🤔
15. ALERT ON THE SYMPTOM
👤
Web
Frontend
��
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Single alert
triggered
Risk
Service
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
⚠
16. ALERT ON THE SYMPTOM - DIFFERENT ISSUE
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Risk
Service
🤔
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
17. ALERT ON THE SYMPTOM - DIFFERENT ISSUE
👤
Web
Frontend
��
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Single alert
triggered
Risk
Service
⚠
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
18. PLACING AN ORDER - ALERT BOMBING
👤
Web
Frontend
��
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Single alert
triggered
Risk
Service
⚠
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
19. ALERTING FOR MICROSERVICES
20. ADAPTIVE PAGING
Adaptive Paging is an alert handler
that leverages the causality from tracing
and OpenTracing's semantic conventions
to page the team closest the problem.
21. DISTRIBUTED TRACING AND OPENTRACING
● A trace tells the story of a transaction or workflow as it propagates through a
distributed system.
● It's basically a directed acyclic graph (DAG), with a clear start and a clear end - no
loops.
● A trace is made up of spans representing contiguous segments of work in that trace.
● Opentracing is a set of vendor-neutral APIs and code instrumentation standard for
distributed tracing
22. DISTRIBUTED TRACING AND OPENTRACING OPENTELEMETRY
● A trace tells the story of a transaction or workflow as it propagates through a
distributed system.
● It's basically a directed acyclic graph (DAG), with a clear start and a clear end - no
loops.
● A trace is made up of spans representing contiguous segments of work in that trace.
● OpenTelemetry is made up of an integrated set of APIs and libraries as well as a
collection mechanism via an agent and collector. It also does distributed tracing
+
=
23. OPENTRACING CONCEPTS
Span: a named operation which records the duration, usually a remote procedure call, with
optional Tags and Logs.
Spans
24. OPENTRACING CONCEPTS
Tag: A "mostly" arbitrary Key:Value pair (value can be a string, number or bool)
Tags
25. OPENTRACING SEMANTIC CONVENTIONS
Span tag name Type Notes and examples
component string The software package, framework, library, or module that
generated the associated Span. E.g., "checkout-service".
error bool true if and only if the application considers the operation
represented by the Span to have failed
peer.service string Remote service name (for some unspecified definition of
"service"). E.g., "accounting-service"
span.kind string Either "client" or "server" for the appropriate roles in an
RPC.
… and more
Opentracing semantic conventions
26. OPENTRACING MONITORING SIGNALS
Latency
Failed operation (error=true)
The Four Golden Signals
SRE Book, Chapter 6: Monitoring Distributed Systems
27. ERROR RATE ALERTING RULE
Alert triggered.
component: checkout_service && operation: place_order
28. ALERT PAYLOAD
29. WALKING THROUGH A TRACE
1.
Starting at the span which was
defined as the signal -
place_order
30. WALKING THROUGH A TRACE
1. Starting at the span which was
defined as the signal -
place_order
2. Inspect every child span's tags
3. Follow path with error=true
31. WALKING THROUGH A TRACE
1. Starting at the span which was
defined as the signal -
place_order
2. Inspect every child span's tags
3. Follow path with error=true
4. Rinse and repeat until no more
children
32. ALERT ON THE SYMPTOM
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Another
Shady
Service
Signal
Risk
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
🤔
33. ALERT ON THE SYMPTOM
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
Single page
dispatched to the
team operating the
Accounting Service
Risk
Service
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
��
Accounting
Service
⚠
34. ALERT ON THE SYMPTOM - DIFFERENT ISSUE
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
Typical
Payment
Blackbox
🤔
Another
Shady
Service
Signal
Risk
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
35. ALERT ON THE SYMPTOM - DIFFERENT ISSUE
👤
Web
Frontend
Checkout
Service
Payment
Gateway
��
Payment
Service
Customer
Typical
Payment
Blackbox
Single page
dispatched to the
team operating the
Payment Service
Risk
Service
⚠
Another
Shady
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Accounting
Service
36. ADAPTIVE PAGING
37. CHALLENGES
●
●
Multiple child spans with error=true:
○ Follow each path, attribute the probable cause a score
○ Analyze more exemplars and adjust the scores
○ Worse case scenario, page both probable causes
Missing instrumentation or circuit breaker open
○
●
Use the peer.service and span.kind=client tag to suggest which
dependency would be the target
Mapping services to escalation
○
Owning team may not have their own on-call escalation. Fallback to closest
38. CONCLUSION
Photo by Patrick Tomasso on Unsplash
39. THANK YOU
QUESTIONS?
Luis Mineiro @voidmaze
We're Hiring!
https://jobs.zalando.com