POSTGRESQL MONITORING IN ZALANDO
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. POSTGRESQL
MONITORING IN
ZALANDO
Helsinki PostgreSQL
Meetup
October 2019
2. POSTGRESQL IN
ZALANDO
2
3. POSTGRESQL IN ZALANDO
Zalando works with PostgreSQL since approx. 2010 (running in DC)
2015
2016-2017
2018-...
3
- migration to AWS (using RDS);
Patroni and Spilo have been started.
- using provided PostgreSQL clusters
(based on Spilo/STUPS);
- using PostgreSQL operator.
4. POSTGRESQL OPERATOR
When a new postgresql custom resource appears,
the operator creates:
1.
2.
3.
4.
4
StatefulSet for PostgreSQL/Patroni cluster
Service for master node (ClusterIP or LB)
Service for replica nodes (ClusterIP or LB)
Extra DNS names for the services if needed
5. ZMON - ZALANDO
MONITORING SYSTEM
5
6. Few facts about ZMON
●
●
●
●
●
6
Created in 2013 during a HackWeek, in production since 2014;
Optimized for our use-case: autonomous teams, multiple
Kubernetes clusters, common central storage;
Open-sourced (APL2.0) and available from GitHub;
Stores time-series data in KairosDB (based on top of Cassandra);
Stores infrastructure data in PostgreSQL.
7. What’s important for us?
●
●
●
●
●
7
ZMON uses the “pull” model, its workers fetch the metrics
ZMON is a distributed monitoring system, the workers are running
in every K8s cluster;
Autodiscover results in the most up-to-date view of our apps;
Checks are centralized and may be shared between the teams;
PostgreSQL Patroni/Spilo clusters are fully supported.
8. INFRASTRUCTURE
MONITORING
8
9. What are the metrics we need?
●
●
9
Kubernetes
○ Pods CPU
○ Network I/O
○ Free space in Persistent Volumes
○ Open TCP connections for PostgreSQL processes
AWS
○ EBS I/O
○ ELB throughput (rarely)
○ Backup S3 bucket
10. Prometheus
10
11. POSTGRESQL INTERNAL
METRICS
11
12. How to collect internal metrics
●
●
●
12
ZMON worker is running in the same K8s cluster as the
PostgreSQL nodes;
It supports running SQL queries natively;
We need only to figure out the proper tables/views to select from.
13. What are the metrics we need?
●
●
●
●
Server
pg_stat_activity
○ idle transactions
○ failed login attempts
pg_stat_user_tables
Tables
○ size / seq_scans / inserts / updates / deletes
Indexes
pg_stat_user_indexes
○ size / scans
Backups
pg_stat_archiver
○ WAL archiver status
○
13
age of last backup
S3 bucket check
14. Idle transactions
14
15. Index scans
15
It looks like a problem!
16. But what’s about authentication?
●
●
●
●
ZMON worker uses its own credentials to connect to all the databases in the
cluster. The credentials are separate from the application credentials;
ZMON worker user is unique in every K8s cluster: robot_cluster_name;
PostgreSQL role robot_zmon is created during the database setup to authorize
the ZMON user to access the database objects;
ZMON may also be authorized to run queries against the application tables or
views, but the permissions for the ZMON role should be granted explicitly in DB:
GRANT SELECT ON ALL TABLES
IN SCHEMA public TO robot_zmon;
16
17. Application metrics
17
18. Key Takeaways
●
●
●
●
18
Get useful metrics from your infrastructure (K8s, AWS, …)
Collect PostgreSQL metrics from the pg_stat_… views
Get your application metrics by querying your tables or views directly
(but beware of the performance impact)
Treat your monitoring tool as another application, not as the superuser
19. Feel free to reach me:
Uri Savelchev
<uri.savelchev@zalando.fi>
Find out more about our
Culture, People & Jobs:
●
ON SOCIAL MEDIA:
Linkedin @Zalando SE
Facebook @ Inside Zalando
Instagram @insidezalando
Twitter @ZalandoTech
●
●
CAREER WEBSITE: jobs.zalando.com
CORPORATE WEBSITE:
corporate.zalando.com