贝宝推出Dione,一个开源的Spark索引库

By Ohad Raviv and Shay Elbaz

作者 奥哈德-拉维夫 沙伊-艾尔巴兹

Photo by Maksym Kaharlytsky on Unsplash

照片:Maksym KaharlytskyonUnsplash

PayPal’s products, both public and internal, rely heavily on data processing in a large variety of techniques and technologies. We, the engineering team in PayPal’s global data science group, are responsible for providing the underlying solutions for these data products. We would like to share an interesting use case we encountered and how we solved it.

PayPal的产品,包括公共产品和内部产品,在很大程度上依赖于大量的技术和工艺的数据处理。我们,PayPal全球数据科学小组的工程团队,负责为这些数据产品提供基础解决方案。我们想分享我们遇到的一个有趣的用例,以及我们如何解决这个问题。

Intro

介绍

Spark, Hive and HDFS (Hadoop Distributed File Systems) ecosystems are online analytical processing (OLAP)-oriented technologies. They are designed to process huge amounts of data with full scans. From time to time, users want to use the same data for more ad-hoc oriented tasks:

Spark、Hive和HDFS(Hadoop分布式文件系统)生态系统是面向在线分析处理(OLAP)的技术。它们被设计用来处理全扫描的海量数据。不时地,用户希望使用相同的数据来完成更多的临时性任务。

  • Multi-row load— explore small sets (typically 1%) of the data by specific IDs (not random).
  • 多行加载--通过特定的ID(不是随机的)探索小的数据集(通常是1%)。
  • Single-row fetch — for example, building a serving layer to fetch a specific row upon a REST-API request.
  • 单行获取--例如,建立一个服务层,在REST-API请求时获取特定行。

These kinds of tasks are traditionally solved using dedicated storage and technology stacks (HBase, Cassandra, etc.) which require data duplication and add significant operational costs.

这类任务传统上是使用专门的存储和技术栈(HBase、Cassandra等)来解决的,这需要数据的重复,并增加了大量的运营成本。

In this post, we describe our journey for solving this challenge by using only Spark and HDFS. We will start by introducing an example use case, generalize and define the requirements, suggest some optional solutions, and finally dive into our final solution.

在这篇文章中,我们描述了我们只使用Spark和HDFS来解决这个挑战的过程。我们将从介绍一个用例开始,概括并定义需求,提出一些可选的解决方案,最后深入到我们的最终解决方案。

At PayPal, we have more than 30 million merchants on our platform. To help us detect potentially fraudulent sellers or violations of PayPal’s acceptable use policies, we periodically use automated...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2026 iteam. Current version is 2.154.0. UTC+08:00, 2026-02-24 18:12
浙ICP备14020137号-1 $访客地图$