ARE WE ALL ON THE SAME PAGE1?

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. ARE WE ALL ON THE SAME PAGE? LET'S FIX THAT Luis Mineiro @voidmaze SRE @ Zalando Coding Serbia, 17.05.2019
2. ZALANDO AT A GLANCE ~ 5.4 billion EUR > 300 million revenue 2018 > 15,500 > 80% employees in Europe of visits via mobile devices as of March 2019 visits per month > 400,000 > 27 product choices million ~ 2,000 17 brands countries active customers
3. as of March 2019
4. WE ARE CONSTANTLY INNOVATING TECHNOLOGY help our brand to HOME-BREWED, CUTTING-EDGE & SCALABLE WIN ONLINE technology solutions 8 international tech locations HQs in Berlin > 2,000 employees at
5. Photo by Dawn Armfield on Unsplash
6. THE AGE OF THE MONOLITH Request Single, large boxes that did everything Jimmy The Monolith Response
7. MONITORING THE MONOLITH Ops Monitoring ● Is the box alive? ● Is the monolith process up? Devs Monitoring ● Are requests returning errors? ● Are requests reasonably fast? Photo by Deneen LT on Pexels
8. MODERN MICROSERVICES ARCHITECTURES Amazon internal service dependency visualization
9. EXAMPLE - PLACING AN ORDER Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
10. MONITORING MICROSERVICES "DevOps" Monitoring ● Is the box alive? ● Is the micro-service process up? ● Are requests returning errors? ● Are requests reasonably fast? Photo by Antoine Plüss on Unsplash
11. FAILURE PLACING AN ORDER 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service 🤔
12. ALERTS ON FAILURE PLACING AN ORDER 👤 Web Frontend �� Checkout Service �� Payment Gateway �� Payment Service Customer �� Typical Payment Blackbox �� Risk Service �� A Queue of Sorts �� Order Service �� Another Shady Service �� �� Logistics Service �� Coupon Service �� Stock Reservation Service �� Machine Learning Shenanigans Random BI Service Accounting Service ⚠
13. ALERTS ON FAILURE PLACING AN ORDER 👤 Web Frontend �� Checkout Service �� Payment Gateway �� Payment Service Customer �� Typical Payment Blackbox �� Risk Service �� A Queue of Sorts �� Order Service �� Another Shady Service �� �� Logistics Service �� Coupon Service �� Stock Reservation Service �� Machine Learning Shenanigans Random BI Service Accounting Service Photo by Antoine Plüss on Unsplash ⚠
14. SYMPTOM BASED ALERTING RULE 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Good signal to noise ratio. Create an alert rule "here" Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
15. ALERT ON THE SYMPTOM 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service 🤔
16. ALERT ON THE SYMPTOM 👤 Web Frontend �� Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single alert triggered Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service ⚠
17. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Risk Service 🤔 Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
18. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend �� Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single alert triggered Risk Service ⚠ Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
19. PLACING AN ORDER - ALERT BOMBING 👤 Web Frontend �� Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single alert triggered Risk Service ⚠ Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
20. ALERTING FOR MICROSERVICES
21. ADAPTIVE PAGING Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem.
22. DISTRIBUTED TRACING AND OPENTRACING ● A trace tells the story of a transaction or workflow as it propagates through a distributed system. ● It's basically directed acyclic graph (DAG), with a clear start and a clear end - no loops. ● A trace is made up of spans representing contiguous segments of work in that trace. ● Opentracing is a set of vendor-neutral APIs and code instrumentation standard for distributed tracing
23. OPENTRACING CONCEPTS Span: a named operation which records the duration, usually a remote procedure call, with optional Tags and Logs. Spans
24. OPENTRACING CONCEPTS Tag: A "mostly" arbitrary Key:Value pair (value can be a string, number or bool) Tags
25. OPENTRACING SEMANTIC CONVENTIONS Span tag name Type Notes and examples component string The software package, framework, library, or module that generated the associated Span. E.g., "checkout-service". error bool true if and only if the application considers the operation represented by the Span to have failed peer.service string Remote service name (for some unspecified definition of "service"). E.g., "accounting-service" span.kind string Either "client" or "server" for the appropriate roles in an RPC. … and more Opentracing semantic conventions
26. OPENTRACING MONITORING SIGNALS Latency Failed operation (error=true) The Four Golden Signals SRE Book, Chapter 6: Monitoring Distributed Systems
27. ERROR RATE ALERTING RULE Alert triggered. component: checkout_service && operation: place_order && error: true
28. ALERT PAYLOAD
29. WALKING THROUGH A TRACE 1. Starting at the span which was defined as the signal - place_order
30. WALKING THROUGH A TRACE 1. Starting at the span which was defined as the signal - place_order 2. Inspect every child span's tags 3. Follow path with error=true
31. WALKING THROUGH A TRACE 1. Starting at the span which was defined as the signal - place_order 2. Inspect every child span's tags 3. Follow path with error=true 4. Rinse and repeat until no more children
32. ALERT ON THE SYMPTOM 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Another Shady Service Signal Risk Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service 🤔
33. ALERT ON THE SYMPTOM 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox Single page dispatched to the team operating the Accounting Service Risk Service Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service �� Accounting Service ⚠
34. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer Typical Payment Blackbox 🤔 Another Shady Service Signal Risk Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
35. ALERT ON THE SYMPTOM - DIFFERENT ISSUE 👤 Web Frontend Checkout Service Payment Gateway �� Payment Service Customer Typical Payment Blackbox Single page dispatched to the team operating the Payment Service Risk Service ⚠ Another Shady Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Accounting Service
36. ADAPTIVE PAGING
37. CHALLENGES ● ● Multiple child spans with error=true: ○ Follow each path, attribute the probable cause a score ○ Analyze more exemplars and adjust the scores ○ Worse case scenario, page both probable causes Missing instrumentation or circuit breaker open ○ ● Use the peer.service and span.kind=client tag to suggest which dependency would be the target Mapping services to escalation ○ Owning team may not have their own on-call escalation.
38. CONCLUSION Photo by Patrick Tomasso on Unsplash
39. ХВАЛА QUESTIONS? Luis Mineiro @voidmaze We're Hiring! https://jobs.zalando.com

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-04 00:14
浙ICP备14020137号-1 $访客地图$