Production Readiness Review in Zalando
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. Production Readiness
Review in Zalando
DevOps Finland meetup 20.08.2024
Uri Savelchev
2. DevOps Finland meetup
Agenda
Production Readiness Review in Zalando
1. What is Production Readiness Review?
2. What’s inside?
3. Why do we need it?
4. Useful links
5. Q & A
3. DevOps Finland meetup
Production Readiness Review in Zalando
What is Production Readiness Review?
The Google SRE book defines it as
... a process that identifies the reliability needs of a service based on its specific details.
Through a PRR, SREs seek to apply what they've learned and experienced to ensure the
reliability of a service operating in production. A PRR is considered a prerequisite for an
SRE team to accept responsibility for managing the production aspects of a service.
4. DevOps Finland meetup
Goal
Production Readiness Review in Zalando
Verify that a [new] service is production-ready in terms of
● Reliability
● Operations
● Data handling
5. DevOps Finland meetup
Steps
Production Readiness Review in Zalando
1. The PRR document is created from a template and
filled in by one of the team members and reviewed
within the team.
2. A Principal Engineer from another organization makes
a review and outlines their comments and questions.
3. A joint session with the PE and the team goes
through the questions and finalizes an action list.
4. The completed PRR is kept and stay valid for two
years or until architecture-level changes in the app.
6. DevOps Finland meetup
What’s inside?
Production Readiness Review in Zalando
7. DevOps Finland meetup
Production Readiness Review in Zalando
8. DevOps Finland meetup
Sections
Production Readiness Review in Zalando
● Context, Background and Production Operations
● Traffic Handling and Observability
● Engineering
● Data Management and ML Models
● Release process
9. DevOps Finland meetup
Context &
Background
Production Readiness Review in Zalando
● What are the application’s business functions?
● Who are the customers? What is expected SLA?
● Architecture diagrams, technical design document
and other technical documents
10. DevOps Finland meetup
Operations
Production Readiness Review in Zalando
● Downtime and failure impacts
● Is on-call required?
● Are necessary alerts and pages tested?
● Do all the on-call engineers have required access?
11. DevOps Finland meetup
Traffic
Handling
Production Readiness Review in Zalando
● Upstream traffic identification
● Rate limits (per upstream and global)
● Blocking bad traffic
12. DevOps Finland meetup
Observability
Production Readiness Review in Zalando
● Dashboards (golden signals, inbound/outbound streams)
● Data storages
● Dependencies / downstream monitoring
● Are the defined alerts and logged errors actionable?
13. DevOps Finland meetup
Engineering
Production Readiness Review in Zalando
● Load testing and resource planning. Scaling.
● Deployment processes and timing
● Are all engineers in the team trained with the used
technologies?
14. DevOps Finland meetup
Failure Modes
Production Readiness Review in Zalando
●
What are the anticipated ways in which this application
might fail?
● Are there single points of failure?
● Can the application be deployed successfully and then
fail to start up?
●
Does the application have timeouts for calls to its
dependencies?
● Connection pools (to DBs and dependencies)
● Resilience patterns (retries, fallbacks, circuit breakers)
15. DevOps Finland meetup
Dependencies
Production Readiness Review in Zalando
● List of dependencies and their SLOs
● Is this service’s SLO is more strict than the product of
the service SLOs it depends on?
●
Could a failure in a downstream cause the application to
fail or respond with failures?
●
Would scaling this application knock out a service it
calls?
16. DevOps Finland meetup
Data & ML
Production Readiness Review in Zalando
● Have data recovery scenarios been tested? How long
do they take to execute?
● Can all data stores be upgraded without downtime?
● Can a single service node or process crash result in
lost data?
● How often or when are the ML models updated?
● What approach is used to verify that a newly trained
ML model is operating correctly?
17. DevOps Finland meetup
Release
Production Readiness Review in Zalando
● Stakeholder management
● Upstream compatibility
● Rollout / rollback plan and criteria
● Data migration risks
18. DevOps Finland meetup
Production Readiness Review in Zalando
Why do we need it?
19. DevOps Finland meetup
Production Readiness Review in Zalando
To ensure operation excellence Zalando uses APEC checklist.
It identifies the most common problems, but for important
applications we need a more deep-dive approach.
20. DevOps Finland meetup
Production Readiness Review in Zalando
21. DevOps Finland meetup
Production Readiness Review in Zalando
Learn more
● James Cusick paper on architecture and production review
● AWS presentation on Production Readiness Review
● Zalando engineering blog
22. DevOps Finland meetup
Production Readiness Review in Zalando