PayPal Introduces Dione, an Open-Source Spark Indexing Library
摘要
Spark, Hive and HDFS (Hadoop Distributed File Systems) ecosystems are online analytical processing (OLAP)-oriented technologies. They are designed to process huge amounts of data with full scans. From time to time, users want to use the same data for more ad-hoc oriented tasks:
- Multi-row load— explore small sets (typically 1%) of the data by specific IDs (not random).
- Single-row fetch — for example, building a serving layer to fetch a specific row upon a REST-API request.
These kinds of tasks are traditionally solved using dedicated storage and technology stacks (HBase, Cassandra, etc.) which require data duplication and add significant operational costs.
In this post, we describe our journey for solving this challenge by using only Spark and HDFS. We will start by introducing an example use case, generalize and define the requirements, suggest some optional solutions, and finally dive into our final solution.
欢迎在评论区写下你对这篇文章的看法。


