Zalandos Quest for Operating 10K Micro Services

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. Zalandos Quest for Operating 10K Micro Services DevOpsCon Berlin - June 2022 Heinrich Hartmann - Head of SRE - Zalando SE
2. about me Head of SRE @ DataScientist @ Mathematician @ Heinrich Hartmann - DevOpsCon Berlin / June 2022 Recent Talks / Publications ● How to measure Latency (P99 Conf 21) ● ● ● ● State of the Histogram (SLOConf 2021) Statistics for Engineers (2014..2019) Latency SLOs Done Right (FOSDEM 2019) Circllhist - A Histogram Data Structure of IT Operations (arxiv)
3. Zalando Business Heinrich Hartmann - DevOpsCon Berlin / June 2022 ● ● ● ● ● Largest Fashion Retailer in EU 10B+ Annual Revenue 50M+ active Customers 23+ Countries 17K Employees
4. Zalando Tech Heinrich Hartmann - DevOpsCon Berlin / June 2022 ● ● ● ● ● ● 2.500+ SWE on Staff 200+ teams In AWS Frankfurt Up to 10K EC2 nodes 200+ k8s clusters 5K+ Micro Services ● Internal Platform providing ○ Managed k8s ○ Managed Postgres ○ Managed Kafka ○ Managed ML Infrastructure ○ ...
5. Zalando Death Star Zalando Micro-Service architecture diagram ~2019 (aka. "Death Star")
6. Service Diagramming Exercise @ Zalando ~2017 Heinrich Hartmann - DevOpsCon Berlin / June 2022
7. Distributed Tracing Heinrich Hartmann - DevOpsCon Berlin / June 2022
8. Trace Website - /add-to-cart Cart API - /add Stock API - /check Stock API - /check Stock API - /check Heinrich Hartmann - DevOpsCon Berlin / June 2022
9. Heinrich Hartmann - DevOpsCon Berlin / June 2022
10. R E D Heinrich Hartmann - DevOpsCon Berlin / June 2022
11. Heinrich Hartmann - DevOpsCon Berlin / June 2022
12. Tracing at Zalando ● First introduced in 2019 ● >3K Applications Instrumented with Tracing (OpenTracing, OpenTelemetry) ● 10M traced operations/second peak ● 3d raw data retention ● 50% sampling applied before ingestion Heinrich Hartmann - DevOpsCon Berlin / June 2022
13. Sampling Error Calculator available on HeinrichHartmann.com Heinrich Hartmann - DevOpsCon Berlin / June 2022
14. ● Observability SDKs Productize Observability for common Engineering patterns: ○ Language Runtimes ○ HTTP/REST APIs ○ DB clients ○ other libraries ● Standardized Dashboards For supported technologies like: k8s, Redis, Kafka, Postgres, ... Heinrich Hartmann - DevOpsCon Berlin / June 2022 Productize Operational Know-How
15. Operation based SLOs P. Alves - Operation Based SLOs (engineering.zalando.com) Heinrich Hartmann - DevOpsCon Berlin / June 2022
16. Service Based SLOs Service List ● ● ● ● ● Proxy Web Rendering Engine Checkout Service Payment Gateway Payment Service ● ● ● ● ● Risk Service Accounting Service Stock Service Customer Service Order Service ● ● ● Random BI Service Coupon Service Typical Payment Blackbox ● ● Logistics Service Mail Notification Service ● Authentication Service ● ● ● ● Another Shady Service Machine Learning Shenanigans Article Service …
17. Critical Business Operations Add To Wishlist Browse Catalog View Product Details Page Heinrich Hartmann - DevOpsCon Berlin / June 2022 View Cart
18. Heinrich Hartmann - DevOpsCon Berlin / June 2022
19. SLO Based Alerting Heinrich Hartmann - DevOpsCon Berlin / June 2022
20. ● SLOs quantify customer experience ● SLO-alerting avoids false-positives and implements Symptom Based Alerting. ● Zalando Alerting Strategy Page only if an SLO is at risk of being breached. ● Allow for cause-based non-paging alerts. But only wake people up, if there is an actual user-facing problem. Heinrich Hartmann - DevOpsCon Berlin / June 2022
21. Adaptive Paging L. Mineiro - Are we on the same page? SRECon 2019 / Login; Heinrich Hartmann - DevOpsCon Berlin / June 2022
22. 👤 Web Frontend Checkout Service Payment Gateway Customer trying to Place an Order Payment Service Another Shady Service Typical Payment Blackbox Risk Service Logistics Service Stock Reservation Service A Queue of Sorts Coupon Service Machine Learning Shenanigans Order Service Random BI Service Heinrich Hartmann - DevOpsCon Berlin / June 2022 Accounting Service
23. The Christmas Tree Problem 👤 �� Web Frontend �� Checkout Service Payment Gateway Customer trying to Place an Order Payment Service Another Shady Service Typical Payment Blackbox Risk Service Logistics Service �� �� Stock Reservation Service A Queue of Sorts �� Coupon Service Machine Learning Shenanigans Order Service Random BI Service Heinrich Hartmann - DevOpsCon Berlin / June 2022 Accounting Service
24. Adaptive Paging 👤 Web Frontend Checkout Service Payment Gateway Payment Service Customer trying to Place an Order Typical Payment Blackbox Risk Service SLO Breach Inspect SLO "Stream" A Queue of Sorts Order Service Heinrich Hartmann - DevOpsCon Berlin / June 2022 Another Shady Service Logistics Service Page Team with "deep" error spans �� Stock Reservation Service Coupon Service Machine Learning Shenanigans L. Mineiro - Are we on the same page? SRECon 2019 / Login;
25. Outsource Metrics Storage Heinrich Hartmann - DevOpsCon Berlin / June 2022 ● ● ● In-House Monitoring System ZMON Metrics Storage Operational pain-point Outsource Metrics Storage in 2021
26. Metrics POC Requirements Highly Scalable, Reliable, "Straight Forward" metrics solution. Load Profile ● 120M metrics ingestion peak ● 300 rps read load ● ● ● No fancy analytics / query patterns 30 day data retention Cost efficient. Competitive with self-hosted solution (inc. staff costs) Heinrich Hartmann - DevOpsCon Berlin / June 2022 Our pick Metrics Runner-up
27. Metrics Transition Architecture Heinrich Hartmann - DevOpsCon Berlin / June 2022
28. Next Up ● More Distributed Tracing ● More Standardisation ● More SLOs ● More Load-Testing Zalando Micro-Service architecture diagram ~2019 (aka. "Death Star") Heinrich Hartmann - DevOpsCon Berlin / June 2022 @HeinrichHartmann

Home - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.0. UTC+08:00, 2025-02-22 04:13
浙ICP备14020137号-1 $Map of visitor$