Production Readiness Review in Zalando

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. Production Readiness Review in Zalando DevOps Finland meetup 20.08.2024 Uri Savelchev
2. DevOps Finland meetup Agenda Production Readiness Review in Zalando 1. What is Production Readiness Review? 2. What’s inside? 3. Why do we need it? 4. Useful links 5. Q & A
3. DevOps Finland meetup Production Readiness Review in Zalando What is Production Readiness Review? The Google SRE book defines it as ... a process that identifies the reliability needs of a service based on its specific details. Through a PRR, SREs seek to apply what they've learned and experienced to ensure the reliability of a service operating in production. A PRR is considered a prerequisite for an SRE team to accept responsibility for managing the production aspects of a service.
4. DevOps Finland meetup Goal Production Readiness Review in Zalando Verify that a [new] service is production-ready in terms of ● Reliability ● Operations ● Data handling
5. DevOps Finland meetup Steps Production Readiness Review in Zalando 1. The PRR document is created from a template and filled in by one of the team members and reviewed within the team. 2. A Principal Engineer from another organization makes a review and outlines their comments and questions. 3. A joint session with the PE and the team goes through the questions and finalizes an action list. 4. The completed PRR is kept and stay valid for two years or until architecture-level changes in the app.
6. DevOps Finland meetup What’s inside? Production Readiness Review in Zalando
7. DevOps Finland meetup Production Readiness Review in Zalando
8. DevOps Finland meetup Sections Production Readiness Review in Zalando ● Context, Background and Production Operations ● Traffic Handling and Observability ● Engineering ● Data Management and ML Models ● Release process
9. DevOps Finland meetup Context & Background Production Readiness Review in Zalando ● What are the application’s business functions? ● Who are the customers? What is expected SLA? ● Architecture diagrams, technical design document and other technical documents
10. DevOps Finland meetup Operations Production Readiness Review in Zalando ● Downtime and failure impacts ● Is on-call required? ● Are necessary alerts and pages tested? ● Do all the on-call engineers have required access?
11. DevOps Finland meetup Traffic Handling Production Readiness Review in Zalando ● Upstream traffic identification ● Rate limits (per upstream and global) ● Blocking bad traffic
12. DevOps Finland meetup Observability Production Readiness Review in Zalando ● Dashboards (golden signals, inbound/outbound streams) ● Data storages ● Dependencies / downstream monitoring ● Are the defined alerts and logged errors actionable?
13. DevOps Finland meetup Engineering Production Readiness Review in Zalando ● Load testing and resource planning. Scaling. ● Deployment processes and timing ● Are all engineers in the team trained with the used technologies?
14. DevOps Finland meetup Failure Modes Production Readiness Review in Zalando ● What are the anticipated ways in which this application might fail? ● Are there single points of failure? ● Can the application be deployed successfully and then fail to start up? ● Does the application have timeouts for calls to its dependencies? ● Connection pools (to DBs and dependencies) ● Resilience patterns (retries, fallbacks, circuit breakers)
15. DevOps Finland meetup Dependencies Production Readiness Review in Zalando ● List of dependencies and their SLOs ● Is this service’s SLO is more strict than the product of the service SLOs it depends on? ● Could a failure in a downstream cause the application to fail or respond with failures? ● Would scaling this application knock out a service it calls?
16. DevOps Finland meetup Data & ML Production Readiness Review in Zalando ● Have data recovery scenarios been tested? How long do they take to execute? ● Can all data stores be upgraded without downtime? ● Can a single service node or process crash result in lost data? ● How often or when are the ML models updated? ● What approach is used to verify that a newly trained ML model is operating correctly?
17. DevOps Finland meetup Release Production Readiness Review in Zalando ● Stakeholder management ● Upstream compatibility ● Rollout / rollback plan and criteria ● Data migration risks
18. DevOps Finland meetup Production Readiness Review in Zalando Why do we need it?
19. DevOps Finland meetup Production Readiness Review in Zalando To ensure operation excellence Zalando uses APEC checklist. It identifies the most common problems, but for important applications we need a more deep-dive approach.
20. DevOps Finland meetup Production Readiness Review in Zalando
21. DevOps Finland meetup Production Readiness Review in Zalando Learn more ● James Cusick paper on architecture and production review ● AWS presentation on Production Readiness Review ● Zalando engineering blog
22. DevOps Finland meetup Production Readiness Review in Zalando

Home - Wiki
Copyright © 2011-2025 iteam. Current version is 2.146.0. UTC+08:00, 2025-10-26 06:17
浙ICP备14020137号-1 $Map of visitor$