PayPal Introduces Dione, an Open-Source Spark Indexing Library

摘要

Spark, Hive and HDFS (Hadoop Distributed File Systems) ecosystems are online analytical processing (OLAP)-oriented technologies. They are designed to process huge amounts of data with full scans. From time to time, users want to use the same data for more ad-hoc oriented tasks:

  • Multi-row load— explore small sets (typically 1%) of the data by specific IDs (not random).
  • Single-row fetch — for example, building a serving layer to fetch a specific row upon a REST-API request.

These kinds of tasks are traditionally solved using dedicated storage and technology stacks (HBase, Cassandra, etc.) which require data duplication and add significant operational costs.

In this post, we describe our journey for solving this challenge by using only Spark and HDFS. We will start by introducing an example use case, generalize and define the requirements, suggest some optional solutions, and finally dive into our final solution.

欢迎在评论区写下你对这篇文章的看法。

评论

首页 - Wiki
Copyright © 2011-2026 iteam. Current version is 2.154.0. UTC+08:00, 2026-02-24 10:50
浙ICP备14020137号-1 $访客地图$