ACID ORC, Iceberg and Delta Lake
如果无法正常显示,请先停止浏览器的去广告插件。
相关话题:
#zalando
1. ACID ORC, Iceberg
and Delta Lake
an overview of table formats
for large scale storage and analytics
Michal Gancarski
michal.gancarski@zalando.de
wssbck
17-10-2019
2. TABLE OF
CONTENTS
All Is Not Well In The Land Of Big Data
There Is Hope, However
This Is How We Do It
Moving Forward
2
3. All Is Not Well
In The Land Of Big Data
3
4. ACID Properties
4
5. Single Node Database
5
6. Distributed Database
6
7. Distributed Data Infrastructure
7
8. Lost ACID
8
9. There Is Hope, However
9
10. A Table Format?
10
11. ACID ORC
11
12. ACID ORC
CREATE TABLE d_manufacturers (id int, name string)
PARTITIONED BY (country string)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
./d_manufacturers/country=de/base_00000002/
-- bucket_00000
-- bucket_00001
./d_manufacturers/country=de/delta_0000003_0000003_0000/
-- bucket_00000
-- bucket_00001
./d_manufacturers/country=de/delta_0000004_0000004_0000/
-- bucket_00001
./d_manufacturers/country=de/delete_delta_0000004_0000004_0000/
-- bucket_00001
12
13. ACID ORC
+
❖
❖
❖
❖
❖
-
13
❖
❖
❖
❖
Native compatibility with Hive
Fast updates / upserts (no file rewrite)
Hive 2.x ACID ORC tables can be converted to
Hive 3.x ACID ORC tables
Commercial Support (Cloudera)
Limited support for Spark (being worked on by
Qubole)
Slow listing and metadata discovery
Potentially slower read due to ad-hoc compaction
ORC only
Mandatory S3Guard or EMR with consistent view
enabled
14. Apache Iceberg
14
15. Apache Iceberg
val df = spark.read
.format("iceberg")
.load("s3://datalake/d_manufacturers")
15
16. Apache Iceberg
+
-
16
❖
❖
❖
❖
❖
❖ Parquet, Avro, ORC supported as file formats
Robust schema and partitioning changes
Fast query planning
Presto connector
Time travel with snapshot id listing
No dependency on Spark
public List<Snapshot> snapshots() {
return snapshots;
}
❖
❖
❖
❖ Spark support
Sparse documentation
No commercial support
Not as mature as other formats
17. Delta Lake
17
18. Delta Lake
val df = spark.read
.format("delta")
.load("s3://datalake/d_manufacturers")
CONVERT TO DELTA parquet.`s3://datalake/d_manufacturers`
./d_manufacturers/_delta_log/
-- 000000.json
-- ...
-- 000010.checkpoint.parquet
-- _latest_checkpoint
./d_manufacturers/country=de/
-- file_1.parquet
-- file_2.parquet
./d_manufacturers/country=fr/
-- file_3.parquet
18
19. Delta Lake
+
-
19
❖
❖
❖
❖
❖
❖ Great integration with Spark, including Structured
Streaming
Merge syntax in Spark SQL
Time travel
Comprehensive, well written documentation
Fast development backed by a commercial entity
VACUUM + OPTIMIZE
Incoming Presto reader (Starburst)
❖
❖ Parquet only
Multicluster writes outside of Databricks only on HDFS
❖
20. This Is How We Do It
20
21. Delta Lake @Zalando
21
22. Moving Forward
22
23. The Future is Bright
23
24. Further Reading
ACID ORC
https://orc.apache.org/docs/acid.html
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions
http://shzhangji.com/blog/2019/06/10/understanding-hive-acid-transactional-table/
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.0/using-hiveql/content/hive_3_internals.html
Iceberg
https://iceberg.apache.org/
https://iceberg.apache.org/spec/
https://github.com/apache/incubator-iceberg
https://www.youtube.com/watch?v=z7p_m17BXs8
https://www.youtube.com/watch?v=nWwQMlrjhy0
Delta Lake
https://delta.io/
https://github.com/delta-io
https://github.com/delta-io/delta/blob/master/PROTOCOL.md
https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
https://databricks.com/blog/2019/09/24/diving-into-delta-lake-schema-enforcement-evolution.html
24
25. Further Reading
Engine Support
https://github.com/prestosql/presto/issues/576
https://github.com/prestosql/presto/issues/1324
https://github.com/prestosql/presto/pull/1067
https://docs.databricks.com/delta/presto-compatibility.html
https://www.starburstdata.com/technical-blog/starburst-presto-databricks-delta-lake-support/
https://www.qubole.com/blog/qubole-open-sources-multi-engine-support-for-updates-and-deletes-in-data-lakes/
https://github.com/qubole/spark-acid
S3 Consistency
https://issues.apache.org/jira/browse/HADOOP-13345
https://hadoop.apache.org/docs/r3.0.3/hadoop-aws/tools/hadoop-aws/s3guard.html
Other
https://www.postgresql.org/docs/current/storage.html
https://www.postgresql.org/docs/current/routine-vacuuming.html
https://dev.mysql.com/doc/refman/8.0/en/optimize-table.html
https://medium.com/@brunocrt/the-distributed-architecture-behind-cassandra-database-fba8b5cc4785
https://github.com/delta-io/delta/issues/41
25
26. ACID ORC, Iceberg
and Delta Lake
Michal Gancarski
michal.gancarski@zalando.de
wssbck
26