ZMON - Monitoring Our Platform
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. ZMON - Monitoring Our Platform
DevOps Meetup Dublin | September 3, 2015 | jan.mussler@zalando.de | @JanMussler
2. ONE of EUROPE’S LARGEST ONLINE FASHION RETAILERS
15 countries
3 fulfillment centers
16+ million active customers
2.2+ billion € revenue 2014
130+ million visits per month
8.000+ employees
Visit us: tech.zalando.com
3. Zalando’s Technology History
4. (Some!) Technologies We Use
5. Monitoring Situation Until Late 2013
ICINGA plus custom frontend (ZMON 1)
Did not scale with growth:
●
●
●
●
Our UI became too slow
Number of systems to check too many
Number of teams that wanted checks grew
Every request had to go through single team
6. Goals of new ZMON development
Improve performance and throughput
Autonomy for individual teams
Flexibility and extendability
Integration into tooling (CMDB, DeployCtl …)
7. The basic terminology ...
Entity:
Anything you may want to monitor
Can be used as a "dimension"
Checks:
Runnable Python snippet fetching data
Alert on Check:
Python expression yielding true or false
8. Zalando Tech - 24x7 team setup
Incident Team
Alerts
observes
Inheritance with
custom thresholds
Database
Team
Alerts
Incident Team
Calls if help needed
2nd Level
SMS / E-Mail
Database
9. Customizable ZMON dashboards
10. Customizable ZMON dashboards
11. Customizable ZMON dashboards
12. Display historic data using Grafana
13. ZMON’s core components
Check/Alert definition
Entity data
PostgreSQL
CLI
(Python)
Slave
Controller
(Java)
Redis
Frontend
(Angular)
Queue/State
Scheduler
(jvm)
Redis
Cassandra
Workers
Workers
Workers
(Python)
(Python)
(Python)
KairosDB
(java)
14. Entities
● hosts, databases, applications, instances ...
● generic key value object
● 4000+ entities in our deployment
Entity "node01:8080"
{
"id": "node01:8080",
"type": "instance",
"host": "node01",
"ports": {"8080":8080,"8181":8181},
"application_id": "zmon",
"application_version": "0.1.0",
"dc":"dc1"
}
15. Database Entity
Entity: customer-live-slave
{
"id": "customer-live-slave",
"type": "database",
"role": "slave",
"environment": "live",
"shards": {
"customer1": "customer1.db:5432/customer1"
"customer2": "customer2.db:5432/customer2"
"customer3": "customer3.db:5432/customer3"
"customer4": "customer4.db:5432/customer4"
}
}
16. Entity Service
Integrated easy-to-use entity store with REST API
>zmon entities push local-postgres.yaml
id: localhost:5432
local-postgres.yaml
type: postgres
host: localhost
port: 5432
shards:
local_zmon_db: "localhost:5432/local_zmon_db"
17. Checks
● select subset of entities
● executes Python expression
○ powerful using eval with custom context
○ Builtins: HTTP, PostgreSQL, MySQL, Cloudwatch,
Redis, SNMP, tcp, SOAP, Scalyr...
● returns "value" object
○ Quickly, every check returned "dicts"
18. Managing checks
REST API to update / auto-import from SCM
zmon check-definitions update select-1-check.yaml
name: "Select 1"
owning_team: "Team 1"
command: |
sql().execute("select 1 as a").results()
entities:
-
type: postgres
interval: 15
description: "test connection"
select-1-check.yaml
19.
20. Alerts
● Executes using a check’s value, bound to single check
● Defines team and responsible team
● Allows inheritance from other alert
● Evaluates Python expression yielding True/False
● No "WARNING" state, no "UNKNOWN" state
● Priorities and tags
21.
22.
23. Trial Run - Quick feedback and download YAML
24. Sharing and reuse of alerts and checks
Anyone can add alerts to checks
Alerts are owned by team
Monitor application boundaries/dependencies
Make use of inheritance to customize
25. ZMON Core + UI + KairosDB
Check/Alert definition
Entity data
PostgreSQL
CLI
(Python)
Slave
Controller
(Java)
Redis
Frontend
(Angular)
Queue/State
Scheduler
(jvm)
Redis
Cassandra
Workers
Workers
Worker
(Python)
(Python)
(Python)
KairosDB
(java)
26. Vagrant Box deploys Docker images
27. Downtimes
● Set or schedule downtimes using the UI
● Use API to automate downtimes, e.g. in deployment tool
28. Extendability - Check and Alert functions
● Improve user experience through provided functions
29. Extendability - Check and Alert functions
● Improve user experience through function wrappers
30. The Microservices World
31. Key Metrics for your service?
● Request rates
● Response rates by HTTP status code
● Latency
32. Expose your data
{
"zmon.response.200.GET.checks.all-active-check-definitions.count": 10,
"zmon.response.200.GET.checks.all-active-check-definitions.fifteenMinuteRate": 0.18076110580284566,
"zmon.response.200.GET.checks.all-active-check-definitions.fiveMinuteRate": 0.1518180485219247,
"zmon.response.200.GET.checks.all-active-check-definitions.meanRate": 0.06792011610723951,
"zmon.response.200.GET.checks.all-active-check-definitions.oneMinuteRate": 0.10512398137982051,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.75thPercentile": 1173,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.95thPercentile": 1233,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.98thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.999thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.99thPercentile": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.max": 1282,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.mean": 1170,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.median": 1161,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.min": 1114,
"zmon.response.200.GET.checks.all-active-check-definitions.snapshot.stdDev": 42,
}
33. Start tracking your metrics
34. Display application statistics
35. Application metrics
36. Continued ...
37. Reuse of check
38. Libraries available for
Spring boot
https://github.com/zalando/zmon-actuator
Clojure
https://github.com/zalando-stups/friboo/
Play (done, to be released)
39. Multi DC / AWS
40. ZMON in AWS Setup
ZMON
Data Service
KairosDB
*.foo.example.org
Team "Foo"
EC2
EC2
Instance
EC2
Instance
Instance
ELB
ZMON
Appliance
*.bar.example.org
ELB
ZMON
Appliance
Team "Bar"
EC2
EC2
Instance
EC2
Instance
Instance
41. Multi DC / Zone deployment possible
● Scheduler supports queue filters by entity
○ e.g. {"dc":"dc1"} vs {"dc":"dc2"} queue filters
● Scheduler can apply base filter
○ only handles entities with {"dc":"dc1"}
● Worker can report home using:
○ Redis (we use this across DCs)
○ HTTPS (AWS->DC)
42. ZMON AWS Agent
Uses Amazon API to fetch:
● ELBs
● EC2 instances
● RDS instances
Pushes enriched entities to entity service
43. Prometheus?
read "text" result
44. Kubernetes Example: Exports in Prometheus text format
kubelet_docker_operations_latency_microseconds{operation_type="inspect_container",quantile="0.9"} 9602
kubelet_docker_operations_latency_microseconds{operation_type="list_containers",quantile="0.9"} 9740
45. Yields a usable nested dictionary
{"list_images":
{"0.9":"120252",
"0.99":"120252",
"0.5":"120252"},
"version":
{"0.9":"1281",
"0.99":"2183",
"0.5":"873"},
"list_containers":
{"0.9":"9740",
"0.99":"23378",
"0.5":"3717"},
"inspect_container":
{"0.9":"9602",
"0.99":"18367",
"0.5":"4419"}
}
46. Internals
47. ZMON’s basic data flow
Scheduler
(jvm)
Redis
{"check":
{"id": 1,
"entity": {"host":"monitor01"},
"command": "snmp().load()",
"alerts":[
{"id":100,
"condition": "value[‘load1’]>10"}
]
}
}
48. ZMON’s basic data flow
Worker
(python)
Redis
-- store check result "snmp().load()"
lpush zmon:check:1:monitor01 {"load1":5,"load5":3,"load15:2}
-- keep last 20 results (for dashboard charts)
ltrim zmon:check:1:monitor01 20
-- alert active?
sadd zmon:alert:100 monitor01
-- alert inactive?
srem zmon:alert:100 monitor01
49. ZMON Vagrant Box:
https://github.com/zalando/zmon
ZMON Homepage:
https://zalando.github.io/zmon
Zalando Tech:
https://tech.zalando.com