PayPal Introduces Dione, an Open-Source Spark Indexing Library

出处：medium.com

存档：存档

译文：中文

摘要

Spark, Hive and HDFS (Hadoop Distributed File Systems) ecosystems are online analytical processing (OLAP)-oriented technologies. They are designed to process huge amounts of data with full scans. From time to time, users want to use the same data for more ad-hoc oriented tasks:

Multi-row load— explore small sets (typically 1%) of the data by specific IDs (not random).

Single-row fetch — for example, building a serving layer to fetch a specific row upon a REST-API request.

These kinds of tasks are traditionally solved using dedicated storage and technology stacks (HBase, Cassandra, etc.) which require data duplication and add significant operational costs.

In this post, we describe our journey for solving this challenge by using only Spark and HDFS. We will start by introducing an example use case, generalize and define the requirements, suggest some optional solutions, and finally dive into our final solution.

阅读原文

xiaozi 于 2021-10-07 分享

13811

关联话题： #paypal #Spark

欢迎在评论区写下你对这篇文章的看法。

PayPal Introduces Dione, an Open-Source Spark Indexing Library

PayPal Introduces Dione, an Open-Source Spark Indexing Library

摘要

评论

文库