ARE WE ALL ON THE SAME PAGE2?

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. ARE WE ALL ON THE SAME PAGE? LET'S FIX THAT Luis Mineiro @voidmaze SRE @ Zalando SREcon EMEA 2019
2. ZALANDO AT A GLANCE ~ 5.4 billion EUR > 300 million revenue 2018 > 15,500 > 80% employees in Europe of visits via mobile devices as of October 2019 visits per month > 400,000 > 27 product choices million ~ 2,000 17 brands countries active customers
3. as of October 2019
4. Photo by Dawn Armfield on Unsplash
5. THE AGE OF THE MONOLITH Request Single, large boxes that did everything Jimmy The Monolith Response
6. MONITORING THE MONOLITH Ops Monitoring ● Is the box alive? ● Is the monolith process up? Devs Monitoring ● Are requests returning errors? ● Are requests reasonably fast? Photo by Deneen LT on Pexels
7. MODERN MICROSERVICES ARCHITECTURES Amazon internal service dependency visualization
8. EXAMPLE - PLACING AN ORDER Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
9. MONITORING MICROSERVICES "DevOps" Monitoring ● Is the box alive? ● Is the micro-service process up? ● Are requests returning errors? ● Are requests reasonably fast? Photo by Antoine Plüss on Unsplash
10. FAILURE PLACING AN ORDER 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service 🤔
11. ALERTS ON FAILURE PLACING AN ORDER 👤 Web Frontend �� Checkout Service �� Payment Gateway �� Payment Service Customer �� Typical Payment Blackbox �� Risk Service �� A Queue of Sorts �� Order Service �� Another Shady Service �� �� Logistics Service �� Coupon Service �� Stock Reservation Service �� Machine Learning Shenanigans Random BI Service Accounting Service ⚠
12. ALERTS ON FAILURE PLACING AN ORDER 👤 Web Frontend �� Checkout Service �� Payment Gateway �� Payment Service Customer �� Typical Payment Blackbox �� Risk Service �� A Queue of Sorts �� Order Service �� Another Shady Service �� �� Logistics Service �� Coupon Service �� Stock Reservation Service �� Machine Learning Shenanigans Random BI Service Accounting Service Photo by Antoine Plüss on Unsplash ⚠
13. SYMPTOM BASED ALERTING RULE 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Good signal to noise ratio. Create an alert rule "here" Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
14. ALERT ON THE SYMPTOM 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service 🤔
15. ALERT ON THE SYMPTOM 👤 Web Frontend �� Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single alert triggered Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service ⚠
16. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service 🤔 Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
17. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend �� Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single alert triggered Risk Service ⚠ Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
18. PLACING AN ORDER - ALERT BOMBING 👤 Web Frontend �� Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single alert triggered Risk Service ⚠ Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
19. ALERTING FOR MICROSERVICES
20. ADAPTIVE PAGING Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem.
21. DISTRIBUTED TRACING AND OPENTRACING ● A trace tells the story of a transaction or workflow as it propagates through a distributed system. ● It's basically a directed acyclic graph (DAG), with a clear start and a clear end - no loops. ● A trace is made up of spans representing contiguous segments of work in that trace. ● Opentracing is a set of vendor-neutral APIs and code instrumentation standard for distributed tracing
22. DISTRIBUTED TRACING AND OPENTRACING OPENTELEMETRY ● A trace tells the story of a transaction or workflow as it propagates through a distributed system. ● It's basically a directed acyclic graph (DAG), with a clear start and a clear end - no loops. ● A trace is made up of spans representing contiguous segments of work in that trace. ● OpenTelemetry is made up of an integrated set of APIs and libraries as well as a collection mechanism via an agent and collector. It also does distributed tracing + =
23. OPENTRACING CONCEPTS Span: a named operation which records the duration, usually a remote procedure call, with optional Tags and Logs. Spans
24. OPENTRACING CONCEPTS Tag: A "mostly" arbitrary Key:Value pair (value can be a string, number or bool) Tags
25. OPENTRACING SEMANTIC CONVENTIONS Span tag name Type Notes and examples component string The software package, framework, library, or module that generated the associated Span. E.g., "checkout-service". error bool true if and only if the application considers the operation represented by the Span to have failed peer.service string Remote service name (for some unspecified definition of "service"). E.g., "accounting-service" span.kind string Either "client" or "server" for the appropriate roles in an RPC. … and more Opentracing semantic conventions
26. OPENTRACING MONITORING SIGNALS Latency Failed operation (error=true) The Four Golden Signals SRE Book, Chapter 6: Monitoring Distributed Systems
27. ERROR RATE ALERTING RULE Alert triggered. component: checkout_service && operation: place_order
28. ALERT PAYLOAD
29. WALKING THROUGH A TRACE 1. Starting at the span which was defined as the signal - place_order
30. WALKING THROUGH A TRACE 1. Starting at the span which was defined as the signal - place_order 2. Inspect every child span's tags 3. Follow path with error=true
31. WALKING THROUGH A TRACE 1. Starting at the span which was defined as the signal - place_order 2. Inspect every child span's tags 3. Follow path with error=true 4. Rinse and repeat until no more children
32. ALERT ON THE SYMPTOM 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Another Shady Service Signal Risk Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service 🤔
33. ALERT ON THE SYMPTOM 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single page dispatched to the team operating the Accounting Service Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service �� Accounting Service ⚠
34. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox 🤔 Another Shady Service Signal Risk Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
35. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend Checkout Service Payment Gateway �� Payment Service Customer Typical Payment Blackbox Single page dispatched to the team operating the Payment Service Risk Service ⚠ Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
36. ADAPTIVE PAGING
37. CHALLENGES ● ● Multiple child spans with error=true: ○ Follow each path, attribute the probable cause a score ○ Analyze more exemplars and adjust the scores ○ Worse case scenario, page both probable causes Missing instrumentation or circuit breaker open ○ ● Use the peer.service and span.kind=client tag to suggest which dependency would be the target Mapping services to escalation ○ Owning team may not have their own on-call escalation. Fallback to closest
38. CONCLUSION Photo by Patrick Tomasso on Unsplash
39. THANK YOU QUESTIONS? Luis Mineiro @voidmaze We're Hiring! https://jobs.zalando.com

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-04 00:10
浙ICP备14020137号-1 $访客地图$