Hudi_Presto 在 News Break 数据平台的尝试

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. Fast Ingestion, Query upon Unified Schema A modern data platform try at NewsBreak Lisheng GUAN, March 2023
2.
3. NewsBreak
4. Data Arch at NewsBreak
5.
6. Pipeline at NewsBreak
7. Pipeline at NewsBreak Legacy CDH to AWS 9s p95 Hours < 15min
8. Hudi at NewsBreak
9. 1. Multi Sink 2.Join rst then Sink
10. Hudi at NewsBreak Performance
11. Hudi at NewsBreak Re nement Join for late data Extra upsert
12. Hudi at NewsBreak Metrics 50 BN. written/mo 30 TB written/mo 3-10 min sync interval 10GB source-limit 10m source-limit
13. Hudi at NewsBreak Details 1. Hudi 0.10.1 on EMR 5.36, from 0.9 (2022) and 0.7 (2021) 2. Default gzip is su cient, 30% better than SNAPPY 3. DeltaStreamer, low code MoR Backport features: Protobuf schema support Customize payload class: partial update Customize transformer class: ltering and basic metrics FileBasedSchemaProvider + ProtoClassBasedSchemaProvider JsonKafkaSource + JsonDFSSource + HoodieIncrSource 4. HMS, and Presto/Spark
14. Hudi at NewsBreak Tips 1. set record.size.estimate explicitly (especially < 1KB) 2. Hourly partition: TimestampBasedKeyGenerator Outputformat: yyyy/MM/dd/HH Update hoodie.table.partition. elds, value to the real partition eld, e.g. p_event_hour Spark SQL to use slash, while presto is good with yyyy-MM-dd-HH 3. Presto version = 0.275 (private codebase, better Hudi support) 4. Appcache not deleted in long running spark in continue mode, Cronjob to delete via EMR bootstrap 5. No handy tool for data retention
15. Presto at NewsBreak
16.
17. Presto at NewsBreak Queries
18. Presto at NewsBreak Metrics 2 Clusters 1,600 Cores 9s P95 550k Queries/mo 6PB S3 bytes read/mo 160 Tri Rows read/mo
19. Presto at NewsBreak Tips 1. Presto 0.275 (inspired by Twilio, from 0.264) With Hudi 0.11.0 compile time dependency, better performance for cross partition queries Custom development : GDA skip, Alluxio local cache support (2.9.2) Presto-event-stream plugin to emit all queries event to Kafka with a schema 2. Sort by a commonly ltered column(s) is super helpful 3. File size or I/O matters Hudi small le management, and clustering with sort column(s) is perfect tted here Fine schemed elds are mandatory for performance, resource and storage 4. CTAS is welcome by Data Analytics
20. Schema at NewsBreak
21.
22. Schema at NewsBreak Adoption
23. Schema at NewsBreak Work ow
24. Schema at NewsBreak Tips 1. PB3 (was Avro) as major language to describe the kafka schemes 2. Keep Json content in Kafka/DFS to limit the impact and binary log would be next UseProtoNames: true EmitUnpopulated: true 3. Each component performs the serializing & deserializing based on generated code 4. A best practice would be needed for schema evolution 5. A work ow needed to guarantee above, also to regular development & release 6. Ideally, most schema update would ow into to data platform automatically, and each engineer would be more focused
25. A try on mode training pipe
26. PB Schema Schema Hudi 3 min latency 500GB+/day
27. A try on mode training pipe Tips 1. Logic in the pipeline begin or end instead of middle. 2. Hive registered table would be widely accessed. Spark SQL needs to use yyyy/MM/dd/HH for partition eld 3. Hoodie cleaner hoodie.cleaner.commits.retained = hoodie.keep.min.commits - 1 hoodie.cleaner.commits.retained * interval >= max query execution time 4. Time window schedule vs x size window schedule
28. Fast Ingestion with Hudi
29. Fast Query with Presto
30. Unified Schema Registry
31.
32. Presto event stream Another tiny example

Home - Wiki
Copyright © 2011-2024 iteam. Current version is 2.129.0. UTC+08:00, 2024-06-29 18:22
浙ICP备14020137号-1 $Map of visitor$