CPU向量化在数据分析领域的探索和实践

1. CPU向量化在数据分析领域的探索和实践马洪宾 Kyligence 技术合伙人

2. 目录 CONTENT 01 WHY 02 ACADEMIA 为什么我们关注这项技术向量化引擎（数据领域）的学术背景 03 INDUSTRY 04 ROADMAP 工业界的探索我们的工作进展和计划

3. 01 WHY 为什么我们关注这项技术

4. Apache Kylin 4.0

5. Enterprise Product based on Kylin Open Core

6. Spark is great, but not (so) good at • Interactive query scenario • High concurrency workload • Elastic and cost-effective • Make full use of modern hardware

7. 02 ACADEMIA 向量化引擎（数据领域）的学术背景

8. 点击添加相关标题文字 Hardware Changes since 2015 ADD RELATED TITLE WORDS u 强调部分请使用蓝色或蓝色Noto Sans Chinese Bold 2010 2015 2020 Storage 50 MB/s (HDD) 500 MB/s (SSD) 16 GB/s (NVMe) 10X Network 1 Gbps 10 Gbps 100 Gbps 10X CPU ~3 GHz ~3 GHz ~3 GHz ☹ CPUs continue to be the bottleneck. How do we achieve next level performance? © Kyligence Inc. 2021, Confidential.

9. 点击添加相关标题文字 How to compute faster? ADD RELATED TITLE WORDS 线程级并行指令级并行数据级并行 © Kyligence Inc. 2021, Confidential.

10. 点击添加相关标题文字指令级并行 ADD RELATED TITLE WORDS © Kyligence Inc. 2021, Confidential.

11. 如何损伤指令级并行点击添加相关标题文字 ADD RELATED TITLE WORDS u 分支预测 u 指令间前后依赖 u 虚函数调用 © Kyligence Inc. 2021, Confidential.

12. 点击添加相关标题文字指令级并行：《Efficiently Compiling Efficient Query Plans for Modern Hardware》（2011） ADD RELATED TITLE WORDS CODEGEN CODEGEN © Kyligence Inc. 2021, Confidential.

13. 指令级并行：《MonetDB/X100: Hyper-Pipelining Query Execution》 (2005) 点击添加相关标题文字 ADD RELATED TITLE WORDS VECTORIZATION © Kyligence Inc. 2021, Confidential.

14. 点击添加相关标题文字数据级并行：《Implementing Database Operations Using SIMD Instructions》 (2002) ADD RELATED TITLE WORDS u Filter/Project u Aggregation u Index Search © Kyligence Inc. 2021, Confidential.

15. 向量化的好处点击添加相关标题文字 ADD RELATED TITLE WORDS u 减少虚函数调用，促进CPU乱序并发执行，IPC > 1 u 使用SIMD等技术优化数据级并行 u 以块为单位，增加CPU cache命中率 © Kyligence Inc. 2021, Confidential.

16. 点击添加相关标题文字 Other papers ADD RELATED TITLE WORDS u Vectorization vs. Compilation in Query Execution u Vectorwise: a Vectorized Analytical DBMS u Breaking the Memory Wall in MonetDB u Rethinking SIMD Vectorization for In-Memory Databases u Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last © Kyligence Inc. 2021, Confidential.

17. 03 INDUSTRY 工业界的探索

18. Photon: Databricks的闭源向量化引擎

19. Other papers SQL Spark DataFrame Koalas Query Optimizer Photon Execution Engine Delta Engine Caching © Kyligence Inc. 2021, Confidential.

20. © Kyligence Inc. 2021, Confidential.

21. Why go to the trouble? 30TB Queries/Hour Other TPC-DS papers (Higher is better) 110 3.3x speedup 32 © Kyligence Inc. 2021, Confidential.

22. Other papers © Kyligence Inc. 2021, Confidential.

23. Alibaba：EMR SparkSQL Native Codegen Framework

24.

25. Intel: Gazelle Engine DataFrame Catalyst Query Plan Optimization Column Rule For Native Op Tungsten Physical Plan Execution Fallback C2R, R2C JVM SQL Engine Operators Expression Code Gen Whole Stage Code Gen Node Tree UDF Columnar WSCG RDD Cache Scheduler Columnar Plugin Native Arrow Data Source Arrow Dataset parquet ColumnBatch Memory Management Arrow Compute Engine Operators Gandiva JIT Columnar Shuffle Exchange Operator Shuffle Manager csv UnsafeRow 25

26. Working Model – in Task Thread Task Thread Worker Node Executor Task Block manager Task Operator 1 Operator 3 Fallback JVM SQL Engine Driver Node Operator 2 N op.Columnar? Y Columnar SQL Engine Plugin Operator Native Operator Wrapper Wrapper Operators Worker Node Executor Block manager Task Task Expression JIT Whole Stage Code Gen JVM Operator JNI Bindings Native Library Operator s Gandiva Expressi on JIT Columnar Native Whole Stage Code Gen Apache Arrow Library 26

27. Performance Per Query Elapsed time in TPCH SF1.5T on 3 x Xeon 6252 nodes Lower is better Power Test performance of TPCH SF1.5TB on 3x Xeon 6252 nodes Lower is better 1600 250 1200 1.68x 1400 200 150 100 50 1000 800 600 400 200 q17 0 Vanilla Spark Gazelle 0 Vanilla Spark Gazelle • 19 of 22 queries are boosted • Operators not fully optimized 27

28. 点击添加相关标题文字 Facebook: Velox ADD RELATED TITLE WORDS u Velox is a C++ database acceleration library which provides reusable, extensible, and high- performance data processing components. These components can be reused to build compute engines focused on different analytical workloads, including batch, interactive, stream processing, and AI/ML… In common usage scenarios, Velox takes a fully optimized query plan as input and performs the described computation. Considering Velox does not provide a SQL parser, a dataframe layer, or a query optimizer, it is usually not meant to be used directly by end-users; rather, it is mostly used by developers integrating and optimizing their compute engines. © Kyligence Inc. 2021, Confidential.

29. 点击添加相关标题文字 Facebook: Velox ADD RELATED TITLE WORDS high-level components u u Type: a generic typing system that supports scalar, complex, and nested types, such as structs, maps, arrays, tensors, etc. Vector: an Arrow-compatible columnar memory layout module, which provides multiple encodings, such as Flat, Dictionary, Constant, Sequence/RLE, and Bias, in addition to a lazy materialization pattern and support for out-of-order writes. u Expression Eval: a fully vectorized expression evaluation engine that allows expressions to be efficiently executed on top of Vector/Arrow encoded data. u Function Packages: sets of vectorized function implementations following the Presto and Spark semantic. u Operators: implementation of common data processing operators such as scans, projection, filtering, groupBy, orderBy, shuffle, hash join, unnest, and more. u I/O: a generic connector interface that allows different file formats (ORC/DWRF and Parquet) and storage adapters (S3, HDFS, local files) to be used. u Network Serializers: an interface where different wire protocols can be implemented, used for network communication, supporting PrestoPage and Spark's UnsafeRow. u Resource Management: a collection of primitives for handling computational resources, such as memory arenas and buffer management, tasks, drivers, and thread pools for CPU and thread execution, spilling, and caching. © Kyligence Inc. 2021, Confidential.

30. 04 ROADMAP 我们的工作进展和展望

31. OVERALL ARCHITECTURE

32. Next Steps • Micro benchmarks based on early integrations • Further comparison with Spark-Velox integration • Incrementally roll out vectorized-operator & expressions • Unified columnar batch format • Other optimizations (e.g., shuffle)

33. Stay tuned Data & AI Meetup

34. 非常感谢您的观看