加速AI推理与检索生成:在PB级数据湖上实现Parquet查询1000倍性能提升
如果无法正常显示,请先停止浏览器的去广告插件。
1. 加速AI推理与检索生成:在PB级数据湖上实
现Parquet查询1000倍性能提升
Bin Fan,
VP of Technology @Alluxio
binfan@alluxio.com
2.
3. 01
The Hook
4. The Challenge: Sub-Millisecond Point Lookups on Petabyte Data Lakes?
Executing point lookup queries like
“SELECT ID, DATA FROM TABLE
WHERE ID = 123”
Apps
over
• partitioned Iceberg data lake (Parquet)
• of tens or even hundreds of PB
• on object stores (e.g., S3)
• within sub-millisecond
Data Lake
5. Why This Matters
- Agentic Memory:
- AI Agents require instant recall of vast
historical knowledge and context.
- Online Feature Store:
- Real-time inference demands
immediate access to fresh, relevant
features.
- Real-time Personalization &
Recommendation:
- Delivering personalized experiences
in milliseconds is key to user
engagement and conversion.
These use cases are driving the need for extreme low-latency access to large-scale data.
6. Common Approaches & Their Limitations: OLAP Engines
How it works: Executing point lookup queries directly
against S3 Parquet via an OLAP engine.
Agentic
Apps
Pros:
- Mature ecosystem, well-supported.
- Handles complex analytics.
Query Engine
Cons:
- Overkill: Heavyweight for simple key-value
lookups.
- High Latency & Concurrency: Query planning,
scheduling, and full Parquet scan overheads
make sub-millisecond unachievable.
Data Lake
7. Common Approaches & Their Limitations: In-Memory KV Stores
How it works: Exporting tables or relevant data
portions into an In-Memory KV Stores.
Pros:
- Low Latency: Fast key-value access.
Cons:
- Prohibitive Cost at Scale: Extremely
expensive to fit Petabytes of data into
memory.
- Data Sync Complexity & Staleness:
Requires ETL pipelines, leading to data lag
and consistency issues (the "Dual-Store
Problem").
- Operational Overhead: Managing two
separate data systems.
Agentic
Apps
In-memory KV Store
Data
Copy
Import
Data Lake
8. "Parquet on S3": Why It's Not Natively Sub-Millisecond?
Apps
How it works: Parquet files are self-indexed, can we
directly point-query them?
Potential Latency Bottlenecks:
● Multiple S3 object store Round-Trip Times (RTTs).
● Row-group/Page decoding overhead.
Result: p95 latency typically > 300ms, far from sub-
millisecond.
Data Lake
9. 02
Alluxio's Philosophy
& Core Tech
10. Introducing Alluxio: Open-Sourced From UC Berkeley
AMPLab in 2014
1,200+ Top 10
10,000+ Top 100
contributors &
growing
Slack Community
Members
Most Critical Java
Based Open
Source Project
Most Valuable
Repositories Out of
96 Million on GitHub
11. Key Feature: Unified Interface to Storage
Alluxio Namespace
/
On-prem data warehouse
AWS us-east-1
hdfs://service/salesdata
Reports
Data
Sales
Alice
Reports
●
●
●
s3://bucket/Users
Users
Sales
Alice
Bob
Bob
Alluxio can be viewed as a logical file system
○ Multiple different storage service can be mounted into same logical Alluxio namespace
An Alluxio path is mapped an persistent storage address
○ alluxio:///Data/Sales ⇔ hdfs:///service/salesdata/Sales
User can pick either Alluxio Logical or original storage address
12. Key Feature: Scalable Distributed Caching
Big Data ETL
Big Data Query
Model Training
Select worker based on
consistent hashing
Alluxio
Worker 1
A
Alluxio
Worker n
Alluxio
Worker 2
C
B
A
B
s3:/bucket/file1
C
s3://bucket/file2
13. Alluxio's Evolution:
Adapting to the Data-Intensive AI Era
DATA EXPLOSION
BIG DATA ANALYTICS
CLOUD ADOPTION
Alluxio scales to
10+ billion files
GENERATIVE AI
Leading ecommerce brand
accelerates model training
Baidu deploys
1000+ node cluster
Alluxio open source
project founded
UC Berkeley AMPLab
7/10 top internet brands
accelerated by Alluxio
1000+ OSS
Contributors
Alluxio scales
to 1 billion files
Zhihu accelerates
LLM model training
Meta accelerates
Presto workloads
2014
9/10 top internet brands
accelerated by Alluxio
2019
Fortune 5 brand
accelerates model
training
AliPay accelerates
model training
2023
2024
14. Trusted by leading enterprises
worldwide to accelerate data and
AI
A screenshot of a phone
Description automatically generated
A screenshot of a phone A screenshot of a phone
Description automatically generated Description automatically generated
A screenshot of a phone
Description automatically generated
TECH & INTERNET
A screenshot of a phone
Description automatically generated
FINANCIAL
SERVICES
A screenshot of a phone
Description automatically generated
A screenshot of a phone
A screenshot of a phone
Description automatically generated
Description automatically generated
A screenshot of a phone
Description automatically generated
E-COMMERCE
A screenshot of a phone
Description automatically generated
A screenshot of a phone
Description automatically generated
A screenshot of a phone
Description automatically generated
Zhihu
TELCO & MEDIA
A screenshot of a phone
Description automatically generated
OTHERS
A screenshot of a phone
Description automatically generated
15. Meet-in-the-Middle Philosophy
● The old debate:
"Move data to compute" OR "Move compute to
data"?
● Alluxio's answer:
Why not both? The caching layer is where they
meet.
● Key Differentiator:
Alluxio provides a data-specific cache, shared by
many applications, NOT an application-specific cache.
16. Holy Grail of Storage Systems
Source: https://jack-
vanlightly.com/blog/2023/11/29
/s3-express-one-zone-not-
quite-what-i-hoped-for
● Cheap
○ S3 Express One Zone 5x compared to S3, even with the recent
reduction
○ Alluxio cache spends money on the hot data, leaving the rest to S3
standard cost
● Low Latency:
○ Achieve sub-millisecond or single-digit millisecond latency for fast
responses
● Scaling Linearly in Capacity:
○ Seamlessly scale to support tens of billions of objects and files.
● High Availability:
○ No centralized metadata service, no single point of failure.
○ Caching in Multi-AZs, Multiple Regions, always backed up by S3
17. 02
The "How"
18. Foundation: Low Latency File Access with Alluxio
To Achieve Low Latency Data Access, we need Low Latency File Access first
● Asynchronous Event Loop:
Each Alluxio worker is built on a high-performance, asynchronous I/O framework. This enables non-blocking I/O with minimal
context switching and thread contention—two major contributors to latency in traditional blocking I/O systems. Its event-driven
model allows one worker instance to scale to thousands of concurrent connections while maintaining sub-millisecond
responsiveness.
● Off-Heap Page Storage on NVMe:
Alluxio leverage NVMe SSDs to store cached pages off-heap. This design allows for significantly higher storage density
without overwhelming memory resources, offering a favorable balance between cost and access latency.
● Zero-Copy I/O:
To avoid unnecessary memory copies and to reduce CPU load, Alluxio employs zero-copy I/O techniques using sendfile()
and mmap(). These allow cached pages to be read directly from NVMe and transmitted over the network stack without
copying through user space, enhancing both throughput and latency.
Result: File Access for ~1KB random reads from cache takes about ~1ms
19. Next: File Access → Parquet Query
● Builds on Previous Fast File Access work with sub-millisecond 1KB read from a cached file from Alluxio
● Add a ParquetReader-like API to query a single-field, single-row point query lookups on Parquet files (stored in
S3, cached in Alluxio)
○ Result: 46ms latency, between S3Express (<10ms) and S3 (300-400ms)
● To achieve sub-ms query latency, further optimization is needed.
● Key Assumptions for Point Query:
○ Point Query: select col1, col2 where id = x;
○ id is primary key, small returned payload (<20KB).
○ id column is sorted OR min/max stats & indexes are available.
Hypothesis: A Standard ParquetReader is too heavy-weight for this specific task.
20. Take A Flamegraph and Verify Hypothesis
21. Leveraging Parquet's
Structure for Speed
Footer: for each rowgroup, min and max of id, so we
can quickly binary search for the right rowgroup
Column Index: within each row group, we can locate
the page containing the right id
And find row number in that page and in the row
group
Offset Index: Find other columns with the same row
number quickly
22. Key Ideas: Optimizing Parquet Point Lookups
● Cache Parquet Metadata in Alluxio (Reduce pointer chasing and lookups)
○ Cache Parquet Footer (file path -> footer).
○ Cache Column Index & Offset Index (file path, column -> index data).
○ Why? Direct access to metadata avoids multiple S3 reads and complex parsing.
● Offload Processing to the Client (Reduce CPU workload on caching node)
○ Send back entire (small) compressed pages with offset, rather than decoding on Alluxio worker.
○ Return Protobuf raw bytes.
○ Why? Shifts CPU work, good for read-heavy point lookups; trades some network for CPU savings on
cache node.
23. Take A Flamegraph Again
24. Journey to Sub-MS: Iterative Optimizations
● We brought latency from 46ms on a
cached alluxio file to 0.4ms using a
specialized interface
● Throughput: 20K QPS per 8-core
storage worker node i4i.2xlarge
25. Cost Analysis of Alluxio vs S3 Express One Zone
S3 Express One Zone EC2: i3en.metal S3 Standard
List Price/TB/Month $110* $132** $23***
Example Data Set Size in TB 500 500 500
% of Data Set Stored 100% 20% 100%
$55,000 $13,200 $11,500
<1 ms <1 ms 100+ ms
Actual Cost/TB/Month
Latency
* At the time of writing, S3 Express One Zone has a list price of $110/TB/Month.
** At the time of writing, on demand pricing for EC2 i3en.12xlarge instances with 30TB of NVMe
capacity was $5.42/hour which calculates to $132/TB/Month.
*** At the time of writing, S3 Standard has a list price of $23/TB/Month.
26. Summary: Alluxio Delivers a 1,000x
Performance Boost
● Key Results
○ Reduced latency for Parquet point lookups on files from 411ms to ~0.4ms
○ Achieved ~20K QPS per 8-core Alluxio worker
● How We Did It (Key Technical Pillars):
○ Meet-in-the-Middle Philosophy: Intelligently caching data where compute meets storage.
○ Low-Latency File Access Foundation: Async I/O, Off-heap NVMe cache, Zero-copy.
○ Targeted Parquet Optimizations: Metadata caching, client-offloaded processing,
predicate/projection pushdown to cache nodes.
● Value Proposition:
○ Sub-millisecond latency on PB-scale S3 data lakes, at a compelling cost point.
27.
28. A joint engineering collaboration between Alluxio and
Salesforce. For More details:
https://www.alluxio.io/whitepaper/meet-in-the-middle-
for-a-1-000x-performance-boost-querying-parquet-
files-on-petabyte-scale-data-lakes
29. THANKS
欢迎技术交流