Hudi_Presto 在 News Break 数据平台的尝试
如果无法正常显示,请先停止浏览器的去广告插件。
1. Fast Ingestion, Query upon
Unified Schema
A modern data platform try at NewsBreak
Lisheng GUAN, March 2023
2.
3. NewsBreak
4. Data Arch at NewsBreak
5.
6. Pipeline at NewsBreak
7. Pipeline at NewsBreak
Legacy CDH to AWS
9s p95
Hours
< 15min
8. Hudi at NewsBreak
9. 1. Multi
Sink
2.Join rst
then Sink
10. Hudi at NewsBreak
Performance
11. Hudi at NewsBreak
Re nement
Join for late data
Extra upsert
12. Hudi at NewsBreak
Metrics
50 BN.
written/mo
30 TB
written/mo
3-10 min
sync interval
10GB
source-limit
10m
source-limit
13. Hudi at NewsBreak
Details
1. Hudi 0.10.1 on EMR 5.36, from 0.9 (2022) and 0.7 (2021)
2. Default gzip is su cient, 30% better than SNAPPY
3. DeltaStreamer, low code
MoR
Backport features: Protobuf schema support
Customize payload class: partial update
Customize transformer class: ltering and basic metrics
FileBasedSchemaProvider + ProtoClassBasedSchemaProvider
JsonKafkaSource + JsonDFSSource + HoodieIncrSource
4. HMS, and Presto/Spark
14. Hudi at NewsBreak
Tips
1. set record.size.estimate explicitly (especially < 1KB)
2. Hourly partition: TimestampBasedKeyGenerator
Outputformat: yyyy/MM/dd/HH
Update hoodie.table.partition. elds, value to the real partition eld, e.g. p_event_hour
Spark SQL to use slash, while presto is good with yyyy-MM-dd-HH
3. Presto version = 0.275 (private codebase, better Hudi support)
4. Appcache not deleted in long running spark in continue mode,
Cronjob to delete via EMR bootstrap
5. No handy tool for data retention
15. Presto at NewsBreak
16.
17. Presto at NewsBreak
Queries
18. Presto at NewsBreak
Metrics
2
Clusters 1,600
Cores 9s
P95
550k
Queries/mo 6PB
S3 bytes read/mo 160 Tri
Rows read/mo
19. Presto at NewsBreak
Tips
1. Presto 0.275 (inspired by Twilio, from 0.264)
With Hudi 0.11.0 compile time dependency, better performance for cross partition queries
Custom development : GDA skip, Alluxio local cache support (2.9.2)
Presto-event-stream plugin to emit all queries event to Kafka with a schema
2. Sort by a commonly ltered column(s) is super helpful
3. File size or I/O matters
Hudi small le management, and clustering with sort column(s) is perfect tted here
Fine schemed elds are mandatory for performance, resource and storage
4. CTAS is welcome by Data Analytics
20. Schema at NewsBreak
21.
22. Schema at NewsBreak
Adoption
23. Schema at NewsBreak
Work ow
24. Schema at NewsBreak
Tips
1. PB3 (was Avro) as major language to describe the kafka schemes
2. Keep Json content in Kafka/DFS to limit the impact and binary log would be next
UseProtoNames: true
EmitUnpopulated: true
3. Each component performs the serializing & deserializing based on generated code
4. A best practice would be needed for schema evolution
5. A work ow needed to guarantee above, also to regular development & release
6. Ideally, most schema update would ow into to data platform automatically, and each engineer
would be more focused
25. A try on mode training pipe
26. PB
Schema
Schema
Hudi
3 min latency
500GB+/day
27. A try on mode training pipe
Tips
1. Logic in the pipeline begin or end instead of middle.
2. Hive registered table would be widely accessed.
Spark SQL needs to use yyyy/MM/dd/HH for partition eld
3. Hoodie cleaner
hoodie.cleaner.commits.retained = hoodie.keep.min.commits - 1
hoodie.cleaner.commits.retained * interval >= max query execution time
4. Time window schedule vs x size window schedule
28. Fast Ingestion with Hudi
29. Fast Query with Presto
30. Unified Schema Registry
31.
32. Presto event stream
Another tiny example