中间件与数据库：Spark的相关资料

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

Pinterest数据工程团队基于Kubernetes构建了新一代大数据处理平台Moka，以替代老化的Hadoop系统。Moka采用Spark on EKS架构，整合了Spark Operator、YuniKorn调度器和Celeborn远程混洗服务，支持ARM/Graviton实例和容器化部署。平台通过Archer作业提交系统实现与现有工作流的无缝集成，并引入自动化验证流程确保迁移稳定性。目前已完成70%批处理作业迁移，性能提升5%，同时通过资源隔离和队列管理优化了成本效益。

pinterest技术

作业帮Spark全面替换Hive实践

作业帮将Hive计算引擎替换为Spark SQL，以应对Hive在资源利用和稳定性上的局限。通过工具化迁移和优化，Spark任务覆盖率达80%，资源节省54%。优化包括内存控制、并发提交、结果集返回、向量化读和JVM GC调优，显著提升性能和稳定性，为未来技术演进奠定基础。

作业帮技术

Fusion 引擎赋能：流利说如何用阿里云 Serverless Spark 实现数仓计算加速

流利说通过引入阿里云EMR Serverless Spark，解决了原有架构在弹性资源管理、费用、性能、运维、监控和扩容方面的痛点。新方案利用Fusion引擎加速任务执行，提升效率，降低成本，并实现按量付费，显著提高了任务稳定性和资源利用率。未来，流利说计划与阿里云合作，进一步优化湖仓场景解决方案。

流利说技术

How Uber Migrated from Hive to Spark SQL for ETL Workloads

Uber将Hive迁移至Spark SQL，以提升计算效率和性能。通过自动化迁移服务（AMS），实现了Hive查询的并行影子测试和数据验证，确保数据一致性和性能优化。迁移过程中克服了语法差异、安全运行等挑战，并开发了查询翻译服务（QTS）以支持复杂查询的转换。最终，迁移成功减少了50%的运行时间和资源使用。

uber技术

Spark on K8s 在vivo大数据平台的混部实战

vivo通过Spark Operator方案实现了离线Spark任务在混部集群的容器化改造，优化了K8s资源调度与任务提交流程。借助弹性调度系统动态管理资源水位线，合理分配任务至多集群，显著提升CPU利用率，高峰期达30%。未来还将扩大任务类型覆盖并优化调度策略，进一步提升混部收益与资源填充效率。

vivo技术

Building a Spark observability product with StarRocks: Real-time and historical performance analysis

Grab 的 Spark 可观测性工具 Iris 通过引入 StarRocks 数据库，解决了实时和历史数据管理的挑战。新架构简化了数据流，支持复杂查询和实时监控，提升了查询性能和用户体验。通过 Kafka 直接数据摄入、物化视图和动态分区优化，Iris 实现了高效的数据存储与分析，为 Spark 作业提供了更强大的监控和调试能力，推动了资源管理和决策效率的提升。

grab技术

How Uber Uses Ray® to Optimize the Rides Business

Computational efficiency is a significant challenge when scaling solutions to a marketplace as large and as complex as Uber. The running and tuning of the Uber rides business relies on substantial…

uber技术

Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest

Monarch, Pinterest’s Batch Processing Platform, was initially designed to support Pinterest’s ever-growing number of Apache Spark and MapReduce workloads at scale. During Monarch’s inception in 2016, the most dominant batch processing technology around to build the platform was Apache Hadoop YARN. Now, eight years later, we have made the decision to move off of Apache Hadoop and onto our next generation Kubernetes (K8s) based platform.

pinterest技术

Pinot for Low-Latency Offline Table Analytics

Learn how Uber uses Apache Pinot for serving over 100 low-latency offline analytics use cases.

uber技术

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

Slack Data Engineering recently migrated their data workload from EMR 5 to EMR 6, using Spark 3 as the processing engine. The migration aimed to improve performance, enhance security, and achieve cost savings. They faced challenges related to supporting the same Hive catalog, provisioning different EMR clusters, controlling costs, and supporting different versions of job libraries. They used various tools and techniques like the Hive Schema Tool, Bazel, and the Airflow Spark operator to address these challenges. The migration allowed them to leverage the benefits of Spark 3 and improve their data processing capabilities. They also performed post-migration data validation to ensure an exact data match between the tables and made use of Trino and their in-house Python framework for detailed analysis. They continuously monitored the runtime of their pipelines and made necessary adjustments.

slack技术

Spark 在反作弊聚类场景的实践

知乎最近开始尝试使用聚类的方式去发现和挖掘spam用户。聚类的目的是将相似的内容和行为聚集在一起。常见的聚类方法有k-means、层次聚类以及基于密度和图的聚类分析方案。相似度的度量是聚类的关键之一，常用的相似度算法有edit distance、cosine similarity、Jaccard相似度和pearson相关系数等。本次聚类使用了Jaccard和sim-hash算法，其中sim-hash适用于数据量较大的场景。sim-hash的计算过程包括词的hash值计算、加权、合并、降维和相似度比较。相似度比较使用hamming distance来衡量。

知乎技术

中间件与数据库：Spark的相关资料

中间件与数据库：Spark

Next Gen Data Processing at Massive Scale At Pinterest With Moka (Part 1 of 2)

作业帮Spark全面替换Hive实践

Fusion 引擎赋能：流利说如何用阿里云 Serverless Spark 实现数仓计算加速

How Uber Migrated from Hive to Spark SQL for ETL Workloads

Spark on K8s 在vivo大数据平台的混部实战

Building a Spark observability product with StarRocks: Real-time and historical performance analysis

How Uber Uses Ray® to Optimize the Rides Business

Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest

Pinot for Low-Latency Offline Table Analytics

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

Spark 在反作弊聚类场景的实践

知乎基于Celeborn优化Spark Shuffle的实践

基于 Native 技术加速 Spark 计算引擎

唯品会SPARK3.0升级之路

Spark向量化计算在美团生产环境的实践

携程数据基础平台2.0建设，多机房架构下的演进