中间件与数据库:Spark

爱奇艺大数据加速:从Hive到Spark SQL

从Hive到Spark SQL,加速67%,助力爱奇艺大数据业务提效增收。

Spark Analysers: Catching Anti-Patterns In Spark Apps

Apache Spark™ is a widely used open source distributed computing engine. It is one of the main components of Uber’s data stack.

Spark is the primary batch compute engine at Uber. Like any other framework, Spark comes with its own set of tradeoffs.

Hive 和 Spark 分区策略剖析

在离线数据处理生态系统最具代表性的分布式处理引擎当属Hive和Spark,它们在分区策略方面有着一些相似之处,但也存在一些不同之处。

Hadoop 及 Spark 分布式HA运行环境搭建

工欲善其事必先利其器,在深入学习大数据相关技术之前,先手动从0到1搭建一个属于自己的本地 Hadoop 和 Spark 运行环境,对于继续研究大数据生态圈各类技术具有重要意义。本文旨在站在研发的角度上通过手动实践搭建运行环境,文中不拖泥带水过多讲述基础知识,结合 Hadoop 和 Spark 最新版本,帮助大家跟着步骤一步步实践环境搭建。

Spark AQE SkewedJoin 在字节跳动的实践和优化

一篇文章读懂Spark AQE SkewedJoin该如何使用。

Spark App 血缘解析方案

本文基于开源 spline 方案的调研,对如何丰富 Spark APP 的血缘解析, 提供了方案和深入的原理剖析。

推荐系统-协同过滤在Spark中的实现

要彻底搞懂一篇论文,最好的方式就是动手复现它,复现的过程你会遇到各种各样的疑惑、理论细节。

Spark离线开发框架设计与实现

本文介绍了Spark离线开发框架的设计与实现,让开发变得简单、易上手,同时也解决了日常工作中数据回溯的痛点问题。

How to Optimize Your Apache Spark Application with Partitions

We can control the way Spark partitions our data and us it to parallelize computations on our dataset.

Shuttle:高可用 高性能 Spark Remote Shuffle Service

Shuttle:一个高可用 高性能的Spark Remote Shuffle Service。支持AQE功能,为Spark引擎提供更稳定,更高效的计算保障。

Spark SQL 字段血缘在 vivo 互联网的实践

字段血缘可以很好的帮助我们了解数据生成的处理过程,在探索中我们发现了可以通过Spark的扩展来优雅的实现这一功能。

How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue

In a previous blog post, we discussed LyftLearn’s infrastructure built on top of Kubernetes. In this post, we will focus on the compute layer of LyftLearn, and will discuss how LyftLearn solves some of the major pain points faced by Lyft’s machine learning practitioners.

Spark在供应链核算中的应用总结

本文总结了工作中Spark在供应链核算中的应用。

字节跳动EMR产品在Spark SQL的优化实践

Hudi、Iceberg等数据湖引擎目前使用的越来越广泛,很多B端客户在使用Spark SQL的时候也存在需要使用数据湖引擎的需求,因此字节EMR产品需要将数据湖引擎集成到Spark SQL中,在这个过程碰到非常多的问题。

京东Spark基于Bloom Filter算法的Runtime Filter Join优化机制

本文讨论京东Spark计算引擎研发团队基于Bloom Filter算法的Runtime Filter Join优化机制,助力京东大促场景的探索和实践。

PayPal Introduces Dione, an Open-Source Spark Indexing Library

Spark, Hive and HDFS (Hadoop Distributed File Systems) ecosystems are online analytical processing (OLAP)-oriented technologies. They are designed to process huge amounts of data with full scans. From time to time, users want to use the same data for more ad-hoc oriented tasks:

  • Multi-row load— explore small sets (typically 1%) of the data by specific IDs (not random).
  • Single-row fetch — for example, building a serving layer to fetch a specific row upon a REST-API request.

These kinds of tasks are traditionally solved using dedicated storage and technology stacks (HBase, Cassandra, etc.) which require data duplication and add significant operational costs.

In this post, we describe our journey for solving this challenge by using only Spark and HDFS. We will start by introducing an example use case, generalize and define the requirements, suggest some optional solutions, and finally dive into our final solution.

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.144.0. UTC+08:00, 2025-07-05 11:57
浙ICP备14020137号-1 $访客地图$