中间件与数据库:Spark
Spark SQL 字段血缘在 vivo 互联网的实践
字段血缘可以很好的帮助我们了解数据生成的处理过程,在探索中我们发现了可以通过Spark的扩展来优雅的实现这一功能。
How LyftLearn Democratizes Distributed Compute through Kubernetes Spark and Fugue
In a previous blog post, we discussed LyftLearn’s infrastructure built on top of Kubernetes. In this post, we will focus on the compute layer of LyftLearn, and will discuss how LyftLearn solves some of the major pain points faced by Lyft’s machine learning practitioners.
Spark在供应链核算中的应用总结
本文总结了工作中Spark在供应链核算中的应用。
字节跳动EMR产品在Spark SQL的优化实践
Hudi、Iceberg等数据湖引擎目前使用的越来越广泛,很多B端客户在使用Spark SQL的时候也存在需要使用数据湖引擎的需求,因此字节EMR产品需要将数据湖引擎集成到Spark SQL中,在这个过程碰到非常多的问题。
京东Spark基于Bloom Filter算法的Runtime Filter Join优化机制
本文讨论京东Spark计算引擎研发团队基于Bloom Filter算法的Runtime Filter Join优化机制,助力京东大促场景的探索和实践。
PayPal Introduces Dione, an Open-Source Spark Indexing Library
Spark, Hive and HDFS (Hadoop Distributed File Systems) ecosystems are online analytical processing (OLAP)-oriented technologies. They are designed to process huge amounts of data with full scans. From time to time, users want to use the same data for more ad-hoc oriented tasks:
- Multi-row load— explore small sets (typically 1%) of the data by specific IDs (not random).
- Single-row fetch — for example, building a serving layer to fetch a specific row upon a REST-API request.
These kinds of tasks are traditionally solved using dedicated storage and technology stacks (HBase, Cassandra, etc.) which require data duplication and add significant operational costs.
In this post, we describe our journey for solving this challenge by using only Spark and HDFS. We will start by introducing an example use case, generalize and define the requirements, suggest some optional solutions, and finally dive into our final solution.
Interactive Querying with Apache Spark SQL at Pinterest
To achieve our mission of bringing everyone inspiration through our visual discovery engine, Pinterest relies heavily on making data-driven decisions to improve the Pinner experience for over 475 million monthly active users. Reliable, fast, and scalable interactive querying is essential to make those data-driven decisions possible. In the past, we published how Presto at Pinterest serves this function. Here, we’ll share how we built a scalable, reliable, and efficient interactive querying platform that processes hundreds of petabytes of data daily with Apache Spark SQL. Through an elaborate discussion on various architecture choices, challenges along the way, and our solutions for those challenges, we share how we made interactive querying with Spark SQL a success.
Tensorflow for Java + Spark-Scala分布式机器学习计算框架的应用实践
Qunar 智能风控场景中,风控研发团队经常会应用一些算法模型,来解决复杂场景问题。典型的如神经网络模型,决策树模型等等。而要完成模型从训练到部署预测的全过程,除了模型算法之外,离不开技术框架的支撑。本篇文章将和大家分享一下,在预测服务部署阶段,基于 Tensorflow for Java 和 Spark-Scala 构建分布式机器学习计算框架的实践经验。
Spark on K8S 在有赞的实践
随着近几年业务快速发展与迭代,大数据的成本也水涨船高,如何优化成本,建设低成本高效率的底层服务成为了有赞数据基础平台2020年的主旋律。本文主要介绍了随着云原生时代的到来,经历7年发展的有赞离线计算平台如何拥抱云原生,通过容器化改造、弹性伸缩、大数据组件的错峰混部,做到业务成倍增长的情况下成本负增长。
Hive SQL迁移Spark SQL在滴滴的实践
在滴滴SQL任务从Hive迁移到Spark后,Spark SQL任务占比提升至85%,任务运行时间节省40%,运行任务需要的计算资源节省21%,内存资源节省49%。在迁移过程中我们沉淀出一套迁移流程, 并且发现并解决了两个引擎在语法,UDF,性能和功能方面的差异。
Spark SQL解析过程以及Antlr4入门
1、Spark SQL解析过程在Spark 2.0之后,Spark SQL使用Antlr 4来解析SQL表达
由Decimal操作计算引发的Spark数据丢失问题
eBay Hadoop Team分享一次数据质量相关问题及相应解决方案。
从Spark Streaming到Apache Flink: 实时数据流在爱奇艺的演进
实时数据平台如何选型?如何落地?表现如何?都是业界普遍关注的问题。本文将为大家介绍Apache Flink在爱奇艺的生产与实践过程,以及从Spark Streaming到Apache Flink的演进过程。
基于Spark GraphX实现微博二度关系推荐
二度关系是指用户与用户通过关注者为桥梁发现到的关注者之间的关系。看微博如何通过二度关系实现了潜在用户的推荐。
Spark性能优化指南——基础篇
想要用好Spark,就必须进行合理的性能优化,才能充分发挥出它的优势。本文主要讲解了笔者实际工作中积累的Spark性能优化方案中的基础内容,包括开发调优以及资源调优。