中间件与数据库:Spark

How Uber Uses Ray® to Optimize the Rides Business

Computational efficiency is a significant challenge when scaling solutions to a marketplace as large and as complex as Uber. The running and tuning of the Uber rides business relies on substantial…

Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest

Monarch, Pinterest’s Batch Processing Platform, was initially designed to support Pinterest’s ever-growing number of Apache Spark and MapReduce workloads at scale. During Monarch’s inception in 2016, the most dominant batch processing technology around to build the platform was Apache Hadoop YARN. Now, eight years later, we have made the decision to move off of Apache Hadoop and onto our next generation Kubernetes (K8s) based platform.

Pinot for Low-Latency Offline Table Analytics

Learn how Uber uses Apache Pinot for serving over 100 low-latency offline analytics use cases.

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

Slack Data Engineering recently migrated their data workload from EMR 5 to EMR 6, using Spark 3 as the processing engine. The migration aimed to improve performance, enhance security, and achieve cost savings. They faced challenges related to supporting the same Hive catalog, provisioning different EMR clusters, controlling costs, and supporting different versions of job libraries. They used various tools and techniques like the Hive Schema Tool, Bazel, and the Airflow Spark operator to address these challenges. The migration allowed them to leverage the benefits of Spark 3 and improve their data processing capabilities. They also performed post-migration data validation to ensure an exact data match between the tables and made use of Trino and their in-house Python framework for detailed analysis. They continuously monitored the runtime of their pipelines and made necessary adjustments.

Spark 在反作弊聚类场景的实践

知乎最近开始尝试使用聚类的方式去发现和挖掘spam用户。聚类的目的是将相似的内容和行为聚集在一起。常见的聚类方法有k-means、层次聚类以及基于密度和图的聚类分析方案。相似度的度量是聚类的关键之一,常用的相似度算法有edit distance、cosine similarity、Jaccard相似度和pearson相关系数等。本次聚类使用了Jaccard和sim-hash算法,其中sim-hash适用于数据量较大的场景。sim-hash的计算过程包括词的hash值计算、加权、合并、降维和相似度比较。相似度比较使用hamming distance来衡量。

知乎基于Celeborn优化Spark Shuffle的实践

知乎使用Hadoop和Spark集群进行大量作业处理,每天的Shuffle量达到3PB以上,单个作业的Shuffle量最大接近100TB。为了保证稳定性,知乎使用了ESS作为Spark的Shuffle服务。然而,ESS存在一些限制,如大量的随机IO导致磁盘IOPS瓶颈,降低作业性能和稳定性。知乎经常遇到IO负载高的节点导致作业耗时不稳定、失败等问题。解决这些问题的方法是减少Shuffle Read Block的数量和大小。

基于 Native 技术加速 Spark 计算引擎

本文介绍了如何通过将Spark的计算模式改为按列计算,并使用C++语言重写逻辑,来提升Spark计算引擎的性能。文章详细讨论了重写Spark SQL内核的工作量和Databricks已实现的闭源C++版本SQL内核。同时,也提出了可以选择一个性能强大的开源引擎,并改造为符合要求的SQL内核,以减少人力成本。最后,文章展示了将ClickHouse作为Spark SQL的示意图。通过改造Spark引擎并利用ClickHouse的优势,可以显著提高性能。

唯品会SPARK3.0升级之路

这篇文章介绍了我们升级SPARK过程中遇到的挑战和思考,希望能给大家带来启发。

Spark向量化计算在美团生产环境的实践

Apache Spark是一个优秀的计算引擎,广泛应用于数据工程、机器学习等领域。向量化执行技术在不升级硬件的情况下,既可获得资源节省和加速作业执行。

携程数据基础平台2.0建设,多机房架构下的演进

分层存储,优先调度,平滑升级

基于SPARK的大规模网络表征算法及其在腾讯游戏中的应用

腾讯游戏社交算法团队开发了一种分布式网络表征算法,用于处理大规模图数据。他们提出了基于递归图分割的算法,将图分割为多个子图,并在每个子图上运行网络表征算法,然后将子图的表征进行融合。该算法已在超过5款游戏的多个业务场景中应用,并取得了显著的效果提升。

基于Spark的大规模推荐系统特征工程

特征工程在推荐系统中有着举足轻重的作用,大规模特征工程处理的效率极大的影响了推荐系统线上的性能。第四范式面向大规模特征工程问题开发了下一代离线在线一致性特征抽取引擎FESQL。

BIGO大数据计算引擎本地化-Apache Spark篇

Gluten的应用和ETL批处理的挑战及研发。

Apache Spark 在爱奇艺的应用实践

爱奇艺大数据平台架构升级过程中,对Spark服务进行改造,大幅提升效率,节省上千万成本。

爱奇艺大数据加速:从Hive到Spark SQL

从Hive到Spark SQL,加速67%,助力爱奇艺大数据业务提效增收。

Spark Analysers: Catching Anti-Patterns In Spark Apps

Apache Spark™ is a widely used open source distributed computing engine. It is one of the main components of Uber’s data stack.

Spark is the primary batch compute engine at Uber. Like any other framework, Spark comes with its own set of tradeoffs.

首页 - Wiki
Copyright © 2011-2026 iteam. Current version is 2.155.0. UTC+08:00, 2026-03-08 20:32
浙ICP备14020137号-1 $访客地图$