加速AI推理与检索生成：在PB级数据湖上实现Parquet查询1000倍性能提升

如果无法正常显示，请先停止浏览器的去广告插件。

1. 加速AI推理与检索生成：在PB级数据湖上实现Parquet查询1000倍性能提升 Bin Fan, VP of Technology @Alluxio binfan@alluxio.com

3. 01 The Hook

4. The Challenge: Sub-Millisecond Point Lookups on Petabyte Data Lakes? Executing point lookup queries like “SELECT ID, DATA FROM TABLE WHERE ID = 123” Apps over • partitioned Iceberg data lake (Parquet) • of tens or even hundreds of PB • on object stores (e.g., S3) • within sub-millisecond Data Lake

5. Why This Matters - Agentic Memory: - AI Agents require instant recall of vast historical knowledge and context. - Online Feature Store: - Real-time inference demands immediate access to fresh, relevant features. - Real-time Personalization & Recommendation: - Delivering personalized experiences in milliseconds is key to user engagement and conversion. These use cases are driving the need for extreme low-latency access to large-scale data.

6. Common Approaches & Their Limitations: OLAP Engines How it works: Executing point lookup queries directly against S3 Parquet via an OLAP engine. Agentic Apps Pros: - Mature ecosystem, well-supported. - Handles complex analytics. Query Engine Cons: - Overkill: Heavyweight for simple key-value lookups. - High Latency & Concurrency: Query planning, scheduling, and full Parquet scan overheads make sub-millisecond unachievable. Data Lake

7. Common Approaches & Their Limitations: In-Memory KV Stores How it works: Exporting tables or relevant data portions into an In-Memory KV Stores. Pros: - Low Latency: Fast key-value access. Cons: - Prohibitive Cost at Scale: Extremely expensive to fit Petabytes of data into memory. - Data Sync Complexity & Staleness: Requires ETL pipelines, leading to data lag and consistency issues (the "Dual-Store Problem"). - Operational Overhead: Managing two separate data systems. Agentic Apps In-memory KV Store Data Copy Import Data Lake

8. "Parquet on S3": Why It's Not Natively Sub-Millisecond? Apps How it works: Parquet files are self-indexed, can we directly point-query them? Potential Latency Bottlenecks: ● Multiple S3 object store Round-Trip Times (RTTs). ● Row-group/Page decoding overhead. Result: p95 latency typically > 300ms, far from sub- millisecond. Data Lake

9. 02 Alluxio's Philosophy & Core Tech

10. Introducing Alluxio: Open-Sourced From UC Berkeley AMPLab in 2014 1,200+ Top 10 10,000+ Top 100 contributors & growing Slack Community Members Most Critical Java Based Open Source Project Most Valuable Repositories Out of 96 Million on GitHub

11. Key Feature: Unified Interface to Storage Alluxio Namespace / On-prem data warehouse AWS us-east-1 hdfs://service/salesdata Reports Data Sales Alice Reports ● ● ● s3://bucket/Users Users Sales Alice Bob Bob Alluxio can be viewed as a logical file system ○ Multiple different storage service can be mounted into same logical Alluxio namespace An Alluxio path is mapped an persistent storage address ○ alluxio:///Data/Sales ⇔ hdfs:///service/salesdata/Sales User can pick either Alluxio Logical or original storage address

12. Key Feature: Scalable Distributed Caching Big Data ETL Big Data Query Model Training Select worker based on consistent hashing Alluxio Worker 1 A Alluxio Worker n Alluxio Worker 2 C B A B s3:/bucket/file1 C s3://bucket/file2

13. Alluxio's Evolution: Adapting to the Data-Intensive AI Era DATA EXPLOSION BIG DATA ANALYTICS CLOUD ADOPTION Alluxio scales to 10+ billion files GENERATIVE AI Leading ecommerce brand accelerates model training Baidu deploys 1000+ node cluster Alluxio open source project founded UC Berkeley AMPLab 7/10 top internet brands accelerated by Alluxio 1000+ OSS Contributors Alluxio scales to 1 billion files Zhihu accelerates LLM model training Meta accelerates Presto workloads 2014 9/10 top internet brands accelerated by Alluxio 2019 Fortune 5 brand accelerates model training AliPay accelerates model training 2023 2024

14. Trusted by leading enterprises worldwide to accelerate data and AI A screenshot of a phone Description automatically generated A screenshot of a phone A screenshot of a phone Description automatically generated Description automatically generated A screenshot of a phone Description automatically generated TECH & INTERNET A screenshot of a phone Description automatically generated FINANCIAL SERVICES A screenshot of a phone Description automatically generated A screenshot of a phone A screenshot of a phone Description automatically generated Description automatically generated A screenshot of a phone Description automatically generated E-COMMERCE A screenshot of a phone Description automatically generated A screenshot of a phone Description automatically generated A screenshot of a phone Description automatically generated Zhihu TELCO & MEDIA A screenshot of a phone Description automatically generated OTHERS A screenshot of a phone Description automatically generated

15. Meet-in-the-Middle Philosophy ● The old debate: "Move data to compute" OR "Move compute to data"? ● Alluxio's answer: Why not both? The caching layer is where they meet. ● Key Differentiator: Alluxio provides a data-specific cache, shared by many applications, NOT an application-specific cache.

16. Holy Grail of Storage Systems Source: https://jack- vanlightly.com/blog/2023/11/29 /s3-express-one-zone-not- quite-what-i-hoped-for ● Cheap ○ S3 Express One Zone 5x compared to S3, even with the recent reduction ○ Alluxio cache spends money on the hot data, leaving the rest to S3 standard cost ● Low Latency: ○ Achieve sub-millisecond or single-digit millisecond latency for fast responses ● Scaling Linearly in Capacity: ○ Seamlessly scale to support tens of billions of objects and files. ● High Availability: ○ No centralized metadata service, no single point of failure. ○ Caching in Multi-AZs, Multiple Regions, always backed up by S3

17. 02 The "How"

18. Foundation: Low Latency File Access with Alluxio To Achieve Low Latency Data Access, we need Low Latency File Access first ● Asynchronous Event Loop: Each Alluxio worker is built on a high-performance, asynchronous I/O framework. This enables non-blocking I/O with minimal context switching and thread contention—two major contributors to latency in traditional blocking I/O systems. Its event-driven model allows one worker instance to scale to thousands of concurrent connections while maintaining sub-millisecond responsiveness. ● Off-Heap Page Storage on NVMe: Alluxio leverage NVMe SSDs to store cached pages off-heap. This design allows for significantly higher storage density without overwhelming memory resources, offering a favorable balance between cost and access latency. ● Zero-Copy I/O: To avoid unnecessary memory copies and to reduce CPU load, Alluxio employs zero-copy I/O techniques using sendfile() and mmap(). These allow cached pages to be read directly from NVMe and transmitted over the network stack without copying through user space, enhancing both throughput and latency. Result: File Access for ~1KB random reads from cache takes about ~1ms

19. Next: File Access → Parquet Query ● Builds on Previous Fast File Access work with sub-millisecond 1KB read from a cached file from Alluxio ● Add a ParquetReader-like API to query a single-field, single-row point query lookups on Parquet files (stored in S3, cached in Alluxio) ○ Result: 46ms latency, between S3Express (<10ms) and S3 (300-400ms) ● To achieve sub-ms query latency, further optimization is needed. ● Key Assumptions for Point Query: ○ Point Query: select col1, col2 where id = x; ○ id is primary key, small returned payload (<20KB). ○ id column is sorted OR min/max stats & indexes are available. Hypothesis: A Standard ParquetReader is too heavy-weight for this specific task.

20. Take A Flamegraph and Verify Hypothesis

21. Leveraging Parquet's Structure for Speed Footer: for each rowgroup, min and max of id, so we can quickly binary search for the right rowgroup Column Index: within each row group, we can locate the page containing the right id And find row number in that page and in the row group Offset Index: Find other columns with the same row number quickly

22. Key Ideas: Optimizing Parquet Point Lookups ● Cache Parquet Metadata in Alluxio (Reduce pointer chasing and lookups) ○ Cache Parquet Footer (file path -> footer). ○ Cache Column Index & Offset Index (file path, column -> index data). ○ Why? Direct access to metadata avoids multiple S3 reads and complex parsing. ● Offload Processing to the Client (Reduce CPU workload on caching node) ○ Send back entire (small) compressed pages with offset, rather than decoding on Alluxio worker. ○ Return Protobuf raw bytes. ○ Why? Shifts CPU work, good for read-heavy point lookups; trades some network for CPU savings on cache node.

23. Take A Flamegraph Again

24. Journey to Sub-MS: Iterative Optimizations ● We brought latency from 46ms on a cached alluxio file to 0.4ms using a specialized interface ● Throughput: 20K QPS per 8-core storage worker node i4i.2xlarge

25. Cost Analysis of Alluxio vs S3 Express One Zone S3 Express One Zone EC2: i3en.metal S3 Standard List Price/TB/Month $110* $132** $23*** Example Data Set Size in TB 500 500 500 % of Data Set Stored 100% 20% 100% $55,000 $13,200 $11,500 <1 ms <1 ms 100+ ms Actual Cost/TB/Month Latency * At the time of writing, S3 Express One Zone has a list price of $110/TB/Month. ** At the time of writing, on demand pricing for EC2 i3en.12xlarge instances with 30TB of NVMe capacity was $5.42/hour which calculates to $132/TB/Month. *** At the time of writing, S3 Standard has a list price of $23/TB/Month.

26. Summary: Alluxio Delivers a 1,000x Performance Boost ● Key Results ○ Reduced latency for Parquet point lookups on files from 411ms to ~0.4ms ○ Achieved ~20K QPS per 8-core Alluxio worker ● How We Did It (Key Technical Pillars): ○ Meet-in-the-Middle Philosophy: Intelligently caching data where compute meets storage. ○ Low-Latency File Access Foundation: Async I/O, Off-heap NVMe cache, Zero-copy. ○ Targeted Parquet Optimizations: Metadata caching, client-offloaded processing, predicate/projection pushdown to cache nodes. ● Value Proposition: ○ Sub-millisecond latency on PB-scale S3 data lakes, at a compelling cost point.

27.

28. A joint engineering collaboration between Alluxio and Salesforce. For More details: https://www.alluxio.io/whitepaper/meet-in-the-middle- for-a-1-000x-performance-boost-querying-parquet- files-on-petabyte-scale-data-lakes

29. THANKS 欢迎技术交流