ZMON - Monitoring Our Platform

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. ZMON - Monitoring Our Platform DevOps Meetup Dublin | September 3, 2015 | jan.mussler@zalando.de | @JanMussler
2. ONE of EUROPE’S LARGEST ONLINE FASHION RETAILERS 15 countries 3 fulfillment centers 16+ million active customers 2.2+ billion € revenue 2014 130+ million visits per month 8.000+ employees Visit us: tech.zalando.com
3. Zalando’s Technology History
4. (Some!) Technologies We Use
5. Monitoring Situation Until Late 2013 ICINGA plus custom frontend (ZMON 1) Did not scale with growth: ● ● ● ● Our UI became too slow Number of systems to check too many Number of teams that wanted checks grew Every request had to go through single team
6. Goals of new ZMON development Improve performance and throughput Autonomy for individual teams Flexibility and extendability Integration into tooling (CMDB, DeployCtl …)
7. The basic terminology ... Entity: Anything you may want to monitor Can be used as a "dimension" Checks: Runnable Python snippet fetching data Alert on Check: Python expression yielding true or false
8. Zalando Tech - 24x7 team setup Incident Team Alerts observes Inheritance with custom thresholds Database Team Alerts Incident Team Calls if help needed 2nd Level SMS / E-Mail Database
9. Customizable ZMON dashboards
10. Customizable ZMON dashboards
11. Customizable ZMON dashboards
12. Display historic data using Grafana
13. ZMON’s core components Check/Alert definition Entity data PostgreSQL CLI (Python) Slave Controller (Java) Redis Frontend (Angular) Queue/State Scheduler (jvm) Redis Cassandra Workers Workers Workers (Python) (Python) (Python) KairosDB (java)
14. Entities ● hosts, databases, applications, instances ... ● generic key value object ● 4000+ entities in our deployment Entity "node01:8080" { "id": "node01:8080", "type": "instance", "host": "node01", "ports": {"8080":8080,"8181":8181}, "application_id": "zmon", "application_version": "0.1.0", "dc":"dc1" }
15. Database Entity Entity: customer-live-slave { "id": "customer-live-slave", "type": "database", "role": "slave", "environment": "live", "shards": { "customer1": "customer1.db:5432/customer1" "customer2": "customer2.db:5432/customer2" "customer3": "customer3.db:5432/customer3" "customer4": "customer4.db:5432/customer4" } }
16. Entity Service Integrated easy-to-use entity store with REST API >zmon entities push local-postgres.yaml id: localhost:5432 local-postgres.yaml type: postgres host: localhost port: 5432 shards: local_zmon_db: "localhost:5432/local_zmon_db"
17. Checks ● select subset of entities ● executes Python expression ○ powerful using eval with custom context ○ Builtins: HTTP, PostgreSQL, MySQL, Cloudwatch, Redis, SNMP, tcp, SOAP, Scalyr... ● returns "value" object ○ Quickly, every check returned "dicts"
18. Managing checks REST API to update / auto-import from SCM zmon check-definitions update select-1-check.yaml name: "Select 1" owning_team: "Team 1" command: | sql().execute("select 1 as a").results() entities: - type: postgres interval: 15 description: "test connection" select-1-check.yaml
19.
20. Alerts ● Executes using a check’s value, bound to single check ● Defines team and responsible team ● Allows inheritance from other alert ● Evaluates Python expression yielding True/False ● No "WARNING" state, no "UNKNOWN" state ● Priorities and tags
21.
22.
23. Trial Run - Quick feedback and download YAML
24. Sharing and reuse of alerts and checks Anyone can add alerts to checks Alerts are owned by team Monitor application boundaries/dependencies Make use of inheritance to customize
25. ZMON Core + UI + KairosDB Check/Alert definition Entity data PostgreSQL CLI (Python) Slave Controller (Java) Redis Frontend (Angular) Queue/State Scheduler (jvm) Redis Cassandra Workers Workers Worker (Python) (Python) (Python) KairosDB (java)
26. Vagrant Box deploys Docker images
27. Downtimes ● Set or schedule downtimes using the UI ● Use API to automate downtimes, e.g. in deployment tool
28. Extendability - Check and Alert functions ● Improve user experience through provided functions
29. Extendability - Check and Alert functions ● Improve user experience through function wrappers
30. The Microservices World
31. Key Metrics for your service? ● Request rates ● Response rates by HTTP status code ● Latency
32. Expose your data { "zmon.response.200.GET.checks.all-active-check-definitions.count": 10, "zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18076110580284566, "zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.1518180485219247, "zmon.response.200.GET.checks.all-active-check-definitions.meanRate": 0.06792011610723951, "zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512398137982051, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.75thPercentile": 1173, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.95thPercentile": 1233, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.98thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.999thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.99thPercentile": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.max": 1282, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.mean": 1170, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.median": 1161, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.min": 1114, "zmon.response.200.GET.checks.all-active-check-definitions.snapshot.stdDev": 42, }
33. Start tracking your metrics
34. Display application statistics
35. Application metrics
36. Continued ...
37. Reuse of check
38. Libraries available for Spring boot https://github.com/zalando/zmon-actuator Clojure https://github.com/zalando-stups/friboo/ Play (done, to be released)
39. Multi DC / AWS
40. ZMON in AWS Setup ZMON Data Service KairosDB *.foo.example.org Team "Foo" EC2 EC2 Instance EC2 Instance Instance ELB ZMON Appliance *.bar.example.org ELB ZMON Appliance Team "Bar" EC2 EC2 Instance EC2 Instance Instance
41. Multi DC / Zone deployment possible ● Scheduler supports queue filters by entity ○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters ● Scheduler can apply base filter ○ only handles entities with {"dc":"dc1"} ● Worker can report home using: ○ Redis (we use this across DCs) ○ HTTPS (AWS->DC)
42. ZMON AWS Agent Uses Amazon API to fetch: ● ELBs ● EC2 instances ● RDS instances Pushes enriched entities to entity service
43. Prometheus? read "text" result
44. Kubernetes Example: Exports in Prometheus text format kubelet_docker_operations_latency_microseconds{operation_type="inspect_container",quantile="0.9"} 9602 kubelet_docker_operations_latency_microseconds{operation_type="list_containers",quantile="0.9"} 9740
45. Yields a usable nested dictionary {"list_images": {"0.9":"120252", "0.99":"120252", "0.5":"120252"}, "version": {"0.9":"1281", "0.99":"2183", "0.5":"873"}, "list_containers": {"0.9":"9740", "0.99":"23378", "0.5":"3717"}, "inspect_container": {"0.9":"9602", "0.99":"18367", "0.5":"4419"} }
46. Internals
47. ZMON’s basic data flow Scheduler (jvm) Redis {"check": {"id": 1, "entity": {"host":"monitor01"}, "command": "snmp().load()", "alerts":[ {"id":100, "condition": "value[‘load1’]>10"} ] } }
48. ZMON’s basic data flow Worker (python) Redis -- store check result "snmp().load()" lpush zmon:check:1:monitor01 {"load1":5,"load5":3,"load15:2} -- keep last 20 results (for dashboard charts) ltrim zmon:check:1:monitor01 20 -- alert active? sadd zmon:alert:100 monitor01 -- alert inactive? srem zmon:alert:100 monitor01
49. ZMON Vagrant Box: https://github.com/zalando/zmon ZMON Homepage: https://zalando.github.io/zmon Zalando Tech: https://tech.zalando.com

Home - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.0. UTC+08:00, 2025-02-22 04:41
浙ICP备14020137号-1 $Map of visitor$