CPU向量化在数据分析领域的探索和实践
如果无法正常显示,请先停止浏览器的去广告插件。
1. CPU向量化在数据分
析领域的探索和实践
马洪宾
Kyligence 技术合伙人
2. 目录 CONTENT
01 WHY
02 ACADEMIA
为什么我们关注这项技术
向量化引擎(数据领域)
的学术背景
03 INDUSTRY
04 ROADMAP
工业界的探索
我们的工作进展和计划
3. 01
WHY
为什么我们关注这项技术
4. Apache Kylin 4.0
5. Enterprise Product based on Kylin Open Core
6. Spark is great, but not (so) good at
• Interactive query scenario
• High concurrency workload
• Elastic and cost-effective
• Make full use of modern hardware
7. 02
ACADEMIA
向量化引擎(数据领域)的学术背景
8. 点击添加相关标题文字
Hardware Changes since 2015
ADD RELATED TITLE WORDS
u
强调部分请使用 蓝色 或 蓝色Noto Sans Chinese Bold
2010 2015 2020 Storage 50 MB/s
(HDD) 500 MB/s
(SSD) 16 GB/s
(NVMe) 10X
Network 1 Gbps 10 Gbps 100 Gbps 10X
CPU ~3 GHz ~3 GHz ~3 GHz ☹
CPUs continue to be the bottleneck.
How do we achieve next level performance?
© Kyligence Inc. 2021, Confidential.
9. 点击添加相关标题文字
How to compute faster?
ADD RELATED TITLE WORDS
线程级并行
指令级并行
数据级并行
© Kyligence Inc. 2021, Confidential.
10. 点击添加相关标题文字
指令级并行
ADD RELATED TITLE WORDS
© Kyligence Inc. 2021, Confidential.
11. 如何损伤指令级并行
点击添加相关标题文字
ADD RELATED TITLE WORDS
u 分支预测
u 指令间前后依赖
u 虚函数调用
© Kyligence Inc. 2021, Confidential.
12. 点击添加相关标题文字
指令级并行:《Efficiently Compiling Efficient Query Plans for Modern Hardware》 (2011)
ADD RELATED TITLE WORDS
CODEGEN
CODEGEN
© Kyligence Inc. 2021, Confidential.
13. 指令级并行:《MonetDB/X100: Hyper-Pipelining Query Execution》 (2005)
点击添加相关标题文字
ADD RELATED TITLE WORDS
VECTORIZATION
© Kyligence Inc. 2021, Confidential.
14. 点击添加相关标题文字
数据级并行:《Implementing Database Operations Using SIMD Instructions》 (2002)
ADD RELATED TITLE WORDS
u Filter/Project
u Aggregation
u Index Search
© Kyligence Inc. 2021, Confidential.
15. 向量化的好处
点击添加相关标题文字
ADD RELATED TITLE WORDS
u 减少虚函数调用, 促进CPU乱序并发执行,IPC > 1
u 使用SIMD等技术优化数据级并行
u 以块为单位,增加CPU cache命中率
© Kyligence Inc. 2021, Confidential.
16. 点击添加相关标题文字
Other papers
ADD RELATED TITLE WORDS
u Vectorization vs. Compilation in Query Execution
u Vectorwise: a Vectorized Analytical DBMS
u Breaking the Memory Wall in MonetDB
u Rethinking SIMD Vectorization for In-Memory Databases
u Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and
Prefetching Work Together At Last
© Kyligence Inc. 2021, Confidential.
17. 03
INDUSTRY
工业界的探索
18. Photon: Databricks的闭源向量化引擎
19. Other papers
SQL
Spark
DataFrame
Koalas
Query
Optimizer
Photon
Execution
Engine
Delta Engine
Caching
© Kyligence Inc. 2021, Confidential.
20. © Kyligence Inc. 2021, Confidential.
21. Why go to the trouble?
30TB Queries/Hour
Other TPC-DS
papers
(Higher is better)
110
3.3x
speedup
32
© Kyligence Inc. 2021, Confidential.
22. Other papers
© Kyligence Inc. 2021, Confidential.
23. Alibaba:EMR SparkSQL Native Codegen Framework
24.
25. Intel: Gazelle Engine
DataFrame
Catalyst Query Plan Optimization
Column Rule
For Native Op
Tungsten Physical Plan Execution
Fallback
C2R, R2C
JVM SQL Engine
Operators
Expression
Code Gen
Whole Stage
Code Gen
Node
Tree
UDF
Columnar WSCG
RDD Cache
Scheduler
Columnar Plugin
Native Arrow Data Source
Arrow Dataset
parquet
ColumnBatch
Memory
Management
Arrow Compute Engine
Operators
Gandiva JIT
Columnar Shuffle
Exchange
Operator
Shuffle
Manager
csv
UnsafeRow
25
26. Working Model – in Task Thread
Task Thread
Worker Node
Executor Task
Block
manager Task
Operator 1
Operator 3
Fallback
JVM SQL Engine
Driver Node
Operator 2
N
op.Columnar?
Y
Columnar SQL Engine Plugin
Operator
Native
Operator Wrapper
Wrapper
Operators
Worker Node
Executor
Block
manager
Task
Task
Expression
JIT
Whole
Stage
Code Gen
JVM Operator
JNI Bindings
Native Library
Operator
s
Gandiva
Expressi
on JIT
Columnar
Native
Whole Stage
Code Gen
Apache Arrow Library
26
27. Performance
Per Query Elapsed time in TPCH SF1.5T on 3
x Xeon 6252 nodes Lower is better
Power Test performance of TPCH SF1.5TB on 3x
Xeon 6252 nodes
Lower is better
1600
250
1200
1.68x
1400
200
150
100
50
1000
800
600
400
200
q17
0
Vanilla Spark
Gazelle
0
Vanilla Spark
Gazelle
• 19 of 22 queries are boosted
• Operators not fully optimized
27
28. 点击添加相关标题文字
Facebook: Velox
ADD RELATED TITLE WORDS
u
Velox is a C++ database acceleration library which provides reusable, extensible, and high-
performance data processing components. These components can be reused to build compute
engines focused on different analytical workloads, including batch, interactive, stream
processing, and AI/ML… In common usage scenarios, Velox takes a fully optimized query plan as
input and performs the described computation. Considering Velox does not provide a SQL parser,
a dataframe layer, or a query optimizer, it is usually not meant to be used directly by end-users;
rather, it is mostly used by developers integrating and optimizing their compute engines.
© Kyligence Inc. 2021, Confidential.
29. 点击添加相关标题文字
Facebook: Velox
ADD RELATED TITLE WORDS
high-level components
u
u
Type: a generic typing system that supports scalar, complex, and nested types, such as structs, maps, arrays, tensors, etc.
Vector: an Arrow-compatible columnar memory layout module, which provides multiple encodings, such as Flat, Dictionary, Constant, Sequence/RLE, and Bias, in
addition to a lazy materialization pattern and support for out-of-order writes.
u Expression Eval: a fully vectorized expression evaluation engine that allows expressions to be efficiently executed on top of Vector/Arrow encoded data.
u Function Packages: sets of vectorized function implementations following the Presto and Spark semantic.
u Operators: implementation of common data processing operators such as scans, projection, filtering, groupBy, orderBy, shuffle, hash join, unnest, and more.
u I/O: a generic connector interface that allows different file formats (ORC/DWRF and Parquet) and storage adapters (S3, HDFS, local files) to be used.
u Network Serializers: an interface where different wire protocols can be implemented, used for network communication, supporting PrestoPage and Spark's
UnsafeRow.
u
Resource Management: a collection of primitives for handling computational resources, such as memory arenas and buffer management, tasks, drivers, and
thread pools for CPU and thread execution, spilling, and caching.
© Kyligence Inc. 2021, Confidential.
30. 04
ROADMAP
我们的工作进展和展望
31. OVERALL ARCHITECTURE
32. Next Steps
• Micro benchmarks based on early integrations
• Further comparison with Spark-Velox integration
• Incrementally roll out vectorized-operator & expressions
• Unified columnar batch format
• Other optimizations (e.g., shuffle)
33. Stay tuned
Data & AI Meetup
34. 非常感谢您的观看