Hudi_Presto 在 News Break 数据平台的尝试

1. Fast Ingestion, Query upon Unified Schema A modern data platform try at NewsBreak Lisheng GUAN, March 2023

2.

3. NewsBreak

4. Data Arch at NewsBreak

5.

6. Pipeline at NewsBreak

7. Pipeline at NewsBreak Legacy CDH to AWS 9s p95 Hours < 15min

8. Hudi at NewsBreak

9. 1. Multi Sink 2.Join rst then Sink

10. Hudi at NewsBreak Performance

11. Hudi at NewsBreak Re nement Join for late data Extra upsert

12. Hudi at NewsBreak Metrics 50 BN. written/mo 30 TB written/mo 3-10 min sync interval 10GB source-limit 10m source-limit

13. Hudi at NewsBreak Details 1. Hudi 0.10.1 on EMR 5.36, from 0.9 (2022) and 0.7 (2021) 2. Default gzip is su cient, 30% better than SNAPPY 3. DeltaStreamer, low code MoR Backport features: Protobuf schema support Customize payload class: partial update Customize transformer class: ltering and basic metrics FileBasedSchemaProvider + ProtoClassBasedSchemaProvider JsonKafkaSource + JsonDFSSource + HoodieIncrSource 4. HMS, and Presto/Spark

14. Hudi at NewsBreak Tips 1. set record.size.estimate explicitly (especially < 1KB) 2. Hourly partition: TimestampBasedKeyGenerator Outputformat: yyyy/MM/dd/HH Update hoodie.table.partition. elds, value to the real partition eld, e.g. p_event_hour Spark SQL to use slash, while presto is good with yyyy-MM-dd-HH 3. Presto version = 0.275 (private codebase, better Hudi support) 4. Appcache not deleted in long running spark in continue mode, Cronjob to delete via EMR bootstrap 5. No handy tool for data retention

15. Presto at NewsBreak

16.

17. Presto at NewsBreak Queries

18. Presto at NewsBreak Metrics 2 Clusters 1,600 Cores 9s P95 550k Queries/mo 6PB S3 bytes read/mo 160 Tri Rows read/mo

19. Presto at NewsBreak Tips 1. Presto 0.275 (inspired by Twilio, from 0.264) With Hudi 0.11.0 compile time dependency, better performance for cross partition queries Custom development : GDA skip, Alluxio local cache support (2.9.2) Presto-event-stream plugin to emit all queries event to Kafka with a schema 2. Sort by a commonly ltered column(s) is super helpful 3. File size or I/O matters Hudi small le management, and clustering with sort column(s) is perfect tted here Fine schemed elds are mandatory for performance, resource and storage 4. CTAS is welcome by Data Analytics

20. Schema at NewsBreak

21.

22. Schema at NewsBreak Adoption

23. Schema at NewsBreak Work ow

24. Schema at NewsBreak Tips 1. PB3 (was Avro) as major language to describe the kafka schemes 2. Keep Json content in Kafka/DFS to limit the impact and binary log would be next UseProtoNames: true EmitUnpopulated: true 3. Each component performs the serializing & deserializing based on generated code 4. A best practice would be needed for schema evolution 5. A work ow needed to guarantee above, also to regular development & release 6. Ideally, most schema update would ow into to data platform automatically, and each engineer would be more focused

25. A try on mode training pipe

26. PB Schema Schema Hudi 3 min latency 500GB+/day

27. A try on mode training pipe Tips 1. Logic in the pipeline begin or end instead of middle. 2. Hive registered table would be widely accessed. Spark SQL needs to use yyyy/MM/dd/HH for partition eld 3. Hoodie cleaner hoodie.cleaner.commits.retained = hoodie.keep.min.commits - 1 hoodie.cleaner.commits.retained * interval >= max query execution time 4. Time window schedule vs x size window schedule

28. Fast Ingestion with Hudi

29. Fast Query with Presto

30. Unified Schema Registry

31.

32. Presto event stream Another tiny example