ACID ORC, Iceberg and Delta Lake

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
相关话题: #zalando
1. ACID ORC, Iceberg and Delta Lake an overview of table formats for large scale storage and analytics Michal Gancarski michal.gancarski@zalando.de wssbck 17-10-2019
2. TABLE OF CONTENTS All Is Not Well In The Land Of Big Data There Is Hope, However This Is How We Do It Moving Forward 2
3. All Is Not Well In The Land Of Big Data 3
4. ACID Properties 4
5. Single Node Database 5
6. Distributed Database 6
7. Distributed Data Infrastructure 7
8. Lost ACID 8
9. There Is Hope, However 9
10. A Table Format? 10
11. ACID ORC 11
12. ACID ORC CREATE TABLE d_manufacturers (id int, name string) PARTITIONED BY (country string) STORED AS ORC TBLPROPERTIES ('transactional'='true'); ./d_manufacturers/country=de/base_00000002/ -- bucket_00000 -- bucket_00001 ./d_manufacturers/country=de/delta_0000003_0000003_0000/ -- bucket_00000 -- bucket_00001 ./d_manufacturers/country=de/delta_0000004_0000004_0000/ -- bucket_00001 ./d_manufacturers/country=de/delete_delta_0000004_0000004_0000/ -- bucket_00001 12
13. ACID ORC + ❖ ❖ ❖ ❖ ❖ - 13 ❖ ❖ ❖ ❖ Native compatibility with Hive Fast updates / upserts (no file rewrite) Hive 2.x ACID ORC tables can be converted to Hive 3.x ACID ORC tables Commercial Support (Cloudera) Limited support for Spark (being worked on by Qubole) Slow listing and metadata discovery Potentially slower read due to ad-hoc compaction ORC only Mandatory S3Guard or EMR with consistent view enabled
14. Apache Iceberg 14
15. Apache Iceberg val df = spark.read .format("iceberg") .load("s3://datalake/d_manufacturers") 15
16. Apache Iceberg + - 16 ❖ ❖ ❖ ❖ ❖ ❖ Parquet, Avro, ORC supported as file formats Robust schema and partitioning changes Fast query planning Presto connector Time travel with snapshot id listing No dependency on Spark public List<Snapshot> snapshots() { return snapshots; } ❖ ❖ ❖ ❖ Spark support Sparse documentation No commercial support Not as mature as other formats
17. Delta Lake 17
18. Delta Lake val df = spark.read .format("delta") .load("s3://datalake/d_manufacturers") CONVERT TO DELTA parquet.`s3://datalake/d_manufacturers` ./d_manufacturers/_delta_log/ -- 000000.json -- ... -- 000010.checkpoint.parquet -- _latest_checkpoint ./d_manufacturers/country=de/ -- file_1.parquet -- file_2.parquet ./d_manufacturers/country=fr/ -- file_3.parquet 18
19. Delta Lake + - 19 ❖ ❖ ❖ ❖ ❖ ❖ Great integration with Spark, including Structured Streaming Merge syntax in Spark SQL Time travel Comprehensive, well written documentation Fast development backed by a commercial entity VACUUM + OPTIMIZE Incoming Presto reader (Starburst) ❖ ❖ Parquet only Multicluster writes outside of Databricks only on HDFS ❖
20. This Is How We Do It 20
21. Delta Lake @Zalando 21
22. Moving Forward 22
23. The Future is Bright 23
24. Further Reading ACID ORC https://orc.apache.org/docs/acid.html https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions http://shzhangji.com/blog/2019/06/10/understanding-hive-acid-transactional-table/ https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.0/using-hiveql/content/hive_3_internals.html Iceberg https://iceberg.apache.org/ https://iceberg.apache.org/spec/ https://github.com/apache/incubator-iceberg https://www.youtube.com/watch?v=z7p_m17BXs8 https://www.youtube.com/watch?v=nWwQMlrjhy0 Delta Lake https://delta.io/ https://github.com/delta-io https://github.com/delta-io/delta/blob/master/PROTOCOL.md https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html 24
25. Further Reading Engine Support https://github.com/prestosql/presto/issues/576 https://github.com/prestosql/presto/issues/1324 https://github.com/prestosql/presto/pull/1067 https://docs.databricks.com/delta/presto-compatibility.html https://www.starburstdata.com/technical-blog/starburst-presto-databricks-delta-lake-support/ https://www.qubole.com/blog/qubole-open-sources-multi-engine-support-for-updates-and-deletes-in-data-lakes/ https://github.com/qubole/spark-acid S3 Consistency https://issues.apache.org/jira/browse/HADOOP-13345 https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoop-aws/s3guard.html Other https://www.postgresql.org/docs/current/storage.html https://www.postgresql.org/docs/current/routine-vacuuming.html https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html https://medium.com/@brunocrt/the-distributed-architecture-behind-cassandra-database-fba8b5cc4785 https://github.com/delta-io/delta/issues/41 25
26. ACID ORC, Iceberg and Delta Lake Michal Gancarski michal.gancarski@zalando.de wssbck 26

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.142.1. UTC+08:00, 2025-04-04 11:21
浙ICP备14020137号-1 $访客地图$