Zalandos Quest for Operating 10K Micro Services
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. Zalandos Quest
for Operating 10K
Micro Services
DevOpsCon Berlin - June 2022
Heinrich Hartmann - Head of SRE - Zalando SE
2. about me
Head of SRE @
DataScientist @
Mathematician @
Heinrich Hartmann - DevOpsCon Berlin / June 2022
Recent Talks / Publications
● How to measure Latency (P99 Conf 21)
●
●
●
● State of the Histogram (SLOConf 2021)
Statistics for Engineers (2014..2019)
Latency SLOs Done Right (FOSDEM 2019)
Circllhist - A Histogram Data Structure of IT Operations
(arxiv)
3. Zalando
Business
Heinrich Hartmann - DevOpsCon Berlin / June 2022
●
●
●
●
●
Largest Fashion Retailer in EU
10B+ Annual Revenue
50M+ active Customers
23+ Countries
17K Employees
4. Zalando
Tech
Heinrich Hartmann - DevOpsCon Berlin / June 2022
●
●
●
●
●
● 2.500+ SWE on Staff
200+ teams
In AWS Frankfurt
Up to 10K EC2 nodes
200+ k8s clusters
5K+ Micro Services
● Internal Platform providing
○ Managed k8s
○ Managed Postgres
○ Managed Kafka
○ Managed ML Infrastructure
○ ...
5. Zalando
Death Star
Zalando Micro-Service architecture diagram ~2019 (aka. "Death Star")
6. Service Diagramming Exercise @ Zalando ~2017
Heinrich Hartmann - DevOpsCon Berlin / June 2022
7. Distributed
Tracing
Heinrich Hartmann - DevOpsCon Berlin / June 2022
8. Trace
Website - /add-to-cart
Cart API - /add
Stock API - /check
Stock API - /check
Stock API - /check
Heinrich Hartmann - DevOpsCon Berlin / June 2022
9. Heinrich Hartmann - DevOpsCon Berlin / June 2022
10. R
E
D
Heinrich Hartmann - DevOpsCon Berlin / June 2022
11. Heinrich Hartmann - DevOpsCon Berlin / June 2022
12. Tracing at Zalando
● First introduced in 2019
● >3K Applications Instrumented with Tracing (OpenTracing, OpenTelemetry)
● 10M traced operations/second peak
● 3d raw data retention
● 50% sampling applied before ingestion
Heinrich Hartmann - DevOpsCon Berlin / June 2022
13. Sampling Error Calculator available on HeinrichHartmann.com
Heinrich Hartmann - DevOpsCon Berlin / June 2022
14. ● Observability SDKs
Productize Observability for common
Engineering patterns:
○ Language Runtimes
○ HTTP/REST APIs
○ DB clients
○ other libraries
● Standardized Dashboards
For supported technologies like:
k8s, Redis, Kafka, Postgres, ...
Heinrich Hartmann - DevOpsCon Berlin / June 2022
Productize
Operational
Know-How
15. Operation
based SLOs
P. Alves - Operation Based SLOs (engineering.zalando.com)
Heinrich Hartmann - DevOpsCon Berlin / June 2022
16. Service Based SLOs
Service List
●
●
●
●
● Proxy Web
Rendering Engine
Checkout Service
Payment Gateway
Payment Service
●
●
●
●
● Risk Service
Accounting Service
Stock Service
Customer Service
Order Service
●
●
● Random BI Service
Coupon Service
Typical Payment Blackbox
●
● Logistics Service
Mail Notification Service
● Authentication Service
●
●
●
● Another Shady Service
Machine Learning Shenanigans
Article Service
…
17. Critical Business Operations
Add To
Wishlist
Browse
Catalog
View Product
Details Page
Heinrich Hartmann - DevOpsCon Berlin / June 2022
View Cart
18. Heinrich Hartmann - DevOpsCon Berlin / June 2022
19. SLO Based
Alerting
Heinrich Hartmann - DevOpsCon Berlin / June 2022
20. ● SLOs quantify customer experience
● SLO-alerting avoids false-positives and implements Symptom Based Alerting.
● Zalando Alerting Strategy
Page only if an SLO is at risk of being breached.
●
Allow for cause-based non-paging alerts. But only wake people up, if there is an actual
user-facing problem.
Heinrich Hartmann - DevOpsCon Berlin / June 2022
21. Adaptive
Paging
L. Mineiro - Are we on the same page? SRECon 2019 / Login;
Heinrich Hartmann - DevOpsCon Berlin / June 2022
22. 👤
Web
Frontend
Checkout
Service
Payment
Gateway
Customer
trying to
Place an
Order
Payment
Service
Another
Shady
Service
Typical
Payment
Blackbox
Risk
Service
Logistics
Service
Stock
Reservation
Service
A Queue
of Sorts
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Heinrich Hartmann - DevOpsCon Berlin / June 2022
Accounting
Service
23. The Christmas Tree Problem
👤
��
Web
Frontend
��
Checkout
Service
Payment
Gateway
Customer
trying to
Place an
Order
Payment
Service
Another
Shady
Service
Typical
Payment
Blackbox
Risk
Service
Logistics
Service
��
��
Stock
Reservation
Service
A Queue
of Sorts
��
Coupon
Service
Machine
Learning
Shenanigans
Order
Service
Random BI
Service
Heinrich Hartmann - DevOpsCon Berlin / June 2022
Accounting
Service
24. Adaptive Paging
👤
Web
Frontend
Checkout
Service
Payment
Gateway
Payment
Service
Customer
trying to
Place an
Order
Typical
Payment
Blackbox
Risk
Service
SLO
Breach
Inspect SLO
"Stream"
A Queue
of Sorts
Order
Service
Heinrich Hartmann - DevOpsCon Berlin / June 2022
Another
Shady
Service
Logistics
Service
Page Team with
"deep"
error spans
��
Stock
Reservation
Service
Coupon
Service
Machine
Learning
Shenanigans
L. Mineiro - Are we on the same page? SRECon 2019 / Login;
25. Outsource
Metrics
Storage
Heinrich Hartmann - DevOpsCon Berlin / June 2022
●
●
●
In-House Monitoring System ZMON
Metrics Storage Operational pain-point
Outsource Metrics Storage in 2021
26. Metrics POC Requirements
Highly Scalable, Reliable, "Straight Forward"
metrics solution.
Load Profile
● 120M metrics ingestion peak
● 300 rps read load
●
●
●
No fancy analytics / query patterns
30 day data retention
Cost efficient. Competitive with
self-hosted solution (inc. staff costs)
Heinrich Hartmann - DevOpsCon Berlin / June 2022
Our pick
Metrics
Runner-up
27. Metrics Transition Architecture
Heinrich Hartmann - DevOpsCon Berlin / June 2022
28. Next Up
● More Distributed Tracing
● More Standardisation
● More SLOs
● More Load-Testing
Zalando Micro-Service architecture diagram ~2019 (aka. "Death Star")
Heinrich Hartmann - DevOpsCon Berlin / June 2022
@HeinrichHartmann