PostgreSQL Berlin March Meetup

1. PostgreSQL Berlin March Meetup Tuesday, March 5, 2024

2. Agenda ● 19:00 - 19:10 - "Introduction to PostgreSQL at Zalando" by Matthias Adler, Team Lead Database as a Service, Zalando SE ● 19:15 - 19:45 "PostgreSQL worst practices" by Ilya Kosmodemiansky, CEO of dataegret.com ● 19:50 - 20:20 "Peculiarities of Logical Decoding in PostgreSQL" by Polina Bungina, Senior Database Engineer, Zalando SE ● 20:20 - Q&A, networking, and Pizza & Drinks ● 22:00 - The End

3. Introduction to PostgreSQL at Zalando Tuesday, March 5, 2024 Matthias Adler Team Lead Database as a Service, Zalando SE

4. PostgreSQL at Zalando ● More than 3000 database clusters on Kubernetes ● Support for PostgreSQL 12 - 16 ● Backups: wal-e (archiving) wal-g (restoring) ● Notable extensions ○ Monitoring: bg_mon ○ Query statistics: pg_stat_statements ○ Partitioning: pg_partman vs. Timescale ○ LLM embeddings: pgvector ● Core Contributions ○ Submitting and reviewing patches for PostgreSQL and maintenance of FOSS projects

5. Patroni ● Industry’s high availability solution for PostgreSQL ● Management of PostgreSQL fleets ○ Leader election ○ Manages PostgreSQL configuration ○ Bootstraps PostgreSQL instances ○ Handles K8s API server outages ● Recent improvements in code quality ○ Completely Dropping Python 2.x ○ Adding type hints ○ Attracted new contributors https://github.com/zalando/patroni

6. Spilo ● The complete bundle: PostgreSQL and Patroni ● Important utility scripts ○ WAL and log shipping to AWS S3 ○ User management for employees ○ Metric helpers for pgView and monitoring ○ Automatic major version upgrades ● Other container images ○ PostgreSQL Docker image for local development ○ Flyway migration template ○ SQL export to AWS S3 https://github.com/zalando/spilo

7. Postgres-Operator ● K8s operator that provisions PostgreSQL (“Spilo”) clusters ● Designed to help engineers and teams ○ Tuning capabilities (CPU, Memory, IOPS etc.) ○ Automation for database roles and privileges ○ Easy access for owning teams ○ UI to inspect manifest, backups & operator logs ● Clone functionality to easily test in production ○ Testing major version upgrade ○ Query performance analysis https://github.com/zalando/postgres-operator

8. PG-View ● Live monitoring of PostgreSQL clusters ● Users could observe ○ High CPU load and OOM exceptions ○ Running out of free disk space ○ Replication delay and stale WAL consumers ○ Long running and/or blocking queries ● Database browser ○ Schema and table view (to check indexes) ○ Statement statistics (and its history) ○ User logins

9. Self-Service Database on K8s apiVersion: acid.zalan.do/v1 kind: postgresql metadata: name: teapot-test namespace: default spec: numberOfInstances: 2 enableConnectionPooler: true postgresql: version: "15" teamId: teapot users: app_user: [] databases: app_db: app_user volume: size: 50Gi # Kubernetes Custom Kubernetes Resource (CRD) definition # pick a PostgreSQL cluster name and namespace # 2 = 1 primary, 1 streaming replica # Enable connection pooler (using PGBouncer) # major version, upgradable by you # team members automatically added to database # application user with privileges to own database # volume size, increasable

10. Continuous Delivery Platform

11. What you get ? ● Manifest validation and delete protection ● In-place major version upgrade ● HA with automatic leader election and failover ● Continuous backups with point-in-time-recovery (up until last 5 days) ● User provisioning, OAuth2 access for employees with SSO integration ● Databases with pre-configured role setup ● And … built-in monitoring with PG-View & Grafana, etc.

12. Break

13. PostgreSQL worst practices Tuesday, March 5, 2024 Ilya Kosmodemiansky CEO of dataegret.com

14. Questions?

15. Break

16. Hazards of logical decoding Polina Bungina, PostgreSQL Berlin Meetup, 05-03-2024

17. Agenda 1. Intro a. 2. 17 CDC, approaches for PostgreSQL Logical decoding a. The hazards b. Specific problems

18. CDC - Change Data Capture ● Is a set of design patterns used to track when and what changes occur in data then notify other systems and services that must respond to those changes. ● 18 Maintain consistency across all systems that rely on data

19. CDC in PostgreSQL 19 ● Queries on Timestamp Column ● Triggers ● Logical decoding

20. Logical decoding Process of extracting all persistent changes to a database’s tables into a coherent, easy to understand format which can be interpreted without detailed knowledge of the database’s internal state. Decoding the contents of the WAL (changes on the storage level) into the app-specific form (e.g. tuples, SQL statements) 20

21. The hazards of logical decoding 21

22. The hazards Replication slots! ● All resources required for the consumer are defined by the slot’s position in the form of an LSN (log sequence number) ● ● 22 Resource is not removed if needed by at least one consumer ○ WAL ○ System catalog rows Crash-safe - state is persisted to disk ○ only on checkpoint ○ independently of the connection using it

23. The hazards - replication slots ● confirmed_flush_lsn - LSN up to which the logical slot's consumer has confirmed receiving data ● restart_lsn - LSN of the oldest WAL which still required by the consumer => won't be removed during checkpoints ● catalog_xmin - the oldest transaction affecting the system catalogs that this slot needs the database to retain => VACUUM cannot remove catalog tuples deleted by any later transaction ● 23 exposed in pg_replication_slots view

24. 24

25. The hazards - replication slots 25

26. The hazards - replication slots 26

27. The hazards Important things to consider ● Single WAL for the whole cluster, not per-database ● Repeated message handling (after a crash/failover) ● Decoded transactions are sent on commit ○ 27 [PG14] streaming of in-progress transactions

28. 28

29. Logical decoding 29

30. Logical decoding 30

31. The hazards “Big” transactions ● logical_decoding_work_mem - max amount of memory to be used by logical decoding, before some of the decoded changes are written to local disk ● Written to: $PGDATA/pg_replslot/<slot_name>/* ● [PG14] pg_stat_replication_slots view ○ 31 spill _txns, spill_count, spill_bytes

32. The hazards - “big” transactions 32

33. Speciﬁc problems 33

34. Specific problems Slot is still there ● Consumer downscaling ○ test the new approach: we go live later and don’t expect new events ○ misconfigured setup: “decoding doesn’t work as expected… Oh, but it is already 6pm, will handle it tomorrow.” ○ 34 test env on the weekends

35. Specific problems But wait, I ● do not produce writes for my publication ○ ● see “WAL is a single stream” for the whole cluster do not produce write activity in the db ○ scheduled vacuum/reindex/… ? ○ autovacuum ○ WAL segment switches (archive_timeout) on any database activity Segment file size is always the same (16MB by default) 35

36. Specific problems Slot is still there ● Cleanup ○ not just stopping to consume - slot should be deleted once not needed ○ 36 Patroni config: delete from the permanent slots section

37. Specific problems Consumer performance 37 ● Failing/not performant consumer == risk of WAL files piling up ● Make use of publications (filtering): ○ FOR TABLE / FOR TABLES IN SCHEMA ○ with (publish = 'insert, update, delete, truncate') ○ unrelated events are not sent to the consumer

38. Specific problems - pgjdbc oversight Pgjdbc oversight 38

39. Specific problems - pgjdbc oversight 39

40. Specific problems - pgjdbc oversight ● Used by Debezium ● Not properly handling Keepalive messages ○ Keepalive msg with the flag requestReply ○ Local flushPtr is not updated with the received sendPtr position ○ Standby Status Update msg with the old flushPtr ○ Slot is not advanced unless there is a write produced for the respective publication ○ Problem of endless shutdown ■ ● 40 a clean shutdown == all changes are received by consumers Fixed in pgjdbc v42.7.0

41. Specific problems “Big” transaction “Transactional outbox” pattern: ● Atomically update the database and send messages to a message broker ● Messages emitted via an “outbox table” in the same transaction ● Improved way (without additional table) - pg_logical_emit_message( transactional boolean, prefix text, content text ) → pg_lsn pg_logical_emit_message( transactional boolean, prefix text, content bytea ) → pg_lsn 41

42. Specific problems - “Big” transaction 42

43. Specific problems - “Big” transaction Initial situation: 43 ● Outbox table grew to 300GB (no immediate deletion) ● Deletion cron job running for several days, as we are already too big ● Table and indexes became bloated ● Table is actively used for reads - latency spikes observed

44. Specific problems - “Big” transaction Let’s run pg_repack! ● New table containing all the rows of the original one ○ 44 INSERT INTO <temp_table> SELECT … FROM <origin_table> ● For 300GB is a long, write-intensive transaction ● full_page_writes = ‘on’ ● Huge amount of WAL generated

45. 15:45 ~/pgdata/pgroot/data$ du -sh pg_replslot/* 70GB pg_replslot/my_slot 15:53 ~/pgdata/pgroot/data$ du -sh pg_replslot/* 152GB pg_replslot/my_slot 16:03 ~/pgdata/pgroot/data$ du -sh pg_replslot/* 185GB pg_replslot/my_slot 16:09 - slot dropped ~/pgdata/pgroot/data$ du -sh pg_replslot/* 8.0K pg_replslot/my_slot 45

46. Thank you! 46

47. 🍕Pizza & Drinks 🍻

48. Thank You! See you again soon …