Paimon流式湖仓架构在字节大规模业务场景的实践

1. 闵文俊

2.

3. 目录

4.

5. 不同的存储引擎之间的数据孤岛, 数据的价值无法被充分挖掘

6. 流批统一降本增效近实时化开放格式, 打破数据孤岛

7.

8.

9. 2022.01 以 Flink 子项目 Flink Table Store的形式孵化 2023.03 捐赠给 Apache 基金会, 成为 Apache 的孵化项目, 面向更开放的开源社区 2024.03 0.4 ~ 0.7 版本发布, 毕业成为 Apache 顶级项目 2024.12 发布具有里程碑意义的 1.0 稳定版本,标志着流式湖仓技术正式迈入成熟发展新阶段

10. • • 广告转化系统实时数仓维表场景

11.

12.

13.

14.

15. CPU占比 35% 30% 25% 20% 15% 10% 5% 0% key value copy build lookup file CPU占比 Parquet write Others

16. https://github.com/apache/paimon/issues/3827

17. Partition level Compaction Strategy

18.

19.

20. • • • • • • • •

21. Paimon 维表的关联性能数据

22. Full Cache Partial Cache

23.

24.

25.

26. • HDFS 慢节点优化 • Sink Reuse 优化

27. • • • • • 收效甚微, 任务稳定性不够

28.

29. FLINK-37375: Checkpoint supports the Operator to customize asynchronous operation

30.

31. • •

32. 多流 UNION ALL 写入

33. Partial Insert 写入

34. FLIP-506: Support Reuse Multiple Table Sinks in Planner

35.

36. From Community To Community • • • • • • •

37.

38.

39. 大模型正在重新定义软件 Large Language Model Is Redefining The Software