中间件与数据库：Spark的相关资料

作业帮Spark全面替换Hive实践

作业帮将Hive计算引擎替换为Spark SQL，以应对Hive在资源利用和稳定性上的局限。通过工具化迁移和优化，Spark任务覆盖率达80%，资源节省54%。优化包括内存控制、并发提交、结果集返回、向量化读和JVM GC调优，显著提升性能和稳定性，为未来技术演进奠定基础。

作业帮技术

Fusion 引擎赋能：流利说如何用阿里云 Serverless Spark 实现数仓计算加速

流利说通过引入阿里云EMR Serverless Spark，解决了原有架构在弹性资源管理、费用、性能、运维、监控和扩容方面的痛点。新方案利用Fusion引擎加速任务执行，提升效率，降低成本，并实现按量付费，显著提高了任务稳定性和资源利用率。未来，流利说计划与阿里云合作，进一步优化湖仓场景解决方案。

流利说技术

How Uber Migrated from Hive to Spark SQL for ETL Workloads

Uber将Hive迁移至Spark SQL，以提升计算效率和性能。通过自动化迁移服务（AMS），实现了Hive查询的并行影子测试和数据验证，确保数据一致性和性能优化。迁移过程中克服了语法差异、安全运行等挑战，并开发了查询翻译服务（QTS）以支持复杂查询的转换。最终，迁移成功减少了50%的运行时间和资源使用。

uber技术

Spark on K8s 在vivo大数据平台的混部实战

vivo通过Spark Operator方案实现了离线Spark任务在混部集群的容器化改造，优化了K8s资源调度与任务提交流程。借助弹性调度系统动态管理资源水位线，合理分配任务至多集群，显著提升CPU利用率，高峰期达30%。未来还将扩大任务类型覆盖并优化调度策略，进一步提升混部收益与资源填充效率。

vivo技术

Building a Spark observability product with StarRocks: Real-time and historical performance analysis

Grab 的 Spark 可观测性工具 Iris 通过引入 StarRocks 数据库，解决了实时和历史数据管理的挑战。新架构简化了数据流，支持复杂查询和实时监控，提升了查询性能和用户体验。通过 Kafka 直接数据摄入、物化视图和动态分区优化，Iris 实现了高效的数据存储与分析，为 Spark 作业提供了更强大的监控和调试能力，推动了资源管理和决策效率的提升。

grab技术

Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest

Monarch, Pinterest’s Batch Processing Platform, was initially designed to support Pinterest’s ever-growing number of Apache Spark and MapReduce workloads at scale. During Monarch’s inception in 2016, the most dominant batch processing technology around to build the platform was Apache Hadoop YARN. Now, eight years later, we have made the decision to move off of Apache Hadoop and onto our next generation Kubernetes (K8s) based platform.

pinterest技术

Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack

Slack Data Engineering recently migrated their data workload from EMR 5 to EMR 6, using Spark 3 as the processing engine. The migration aimed to improve performance, enhance security, and achieve cost savings. They faced challenges related to supporting the same Hive catalog, provisioning different EMR clusters, controlling costs, and supporting different versions of job libraries. They used various tools and techniques like the Hive Schema Tool, Bazel, and the Airflow Spark operator to address these challenges. The migration allowed them to leverage the benefits of Spark 3 and improve their data processing capabilities. They also performed post-migration data validation to ensure an exact data match between the tables and made use of Trino and their in-house Python framework for detailed analysis. They continuously monitored the runtime of their pipelines and made necessary adjustments.

slack技术

Spark 在反作弊聚类场景的实践

知乎最近开始尝试使用聚类的方式去发现和挖掘spam用户。聚类的目的是将相似的内容和行为聚集在一起。常见的聚类方法有k-means、层次聚类以及基于密度和图的聚类分析方案。相似度的度量是聚类的关键之一，常用的相似度算法有edit distance、cosine similarity、Jaccard相似度和pearson相关系数等。本次聚类使用了Jaccard和sim-hash算法，其中sim-hash适用于数据量较大的场景。sim-hash的计算过程包括词的hash值计算、加权、合并、降维和相似度比较。相似度比较使用hamming distance来衡量。

知乎技术

知乎基于Celeborn优化Spark Shuffle的实践

知乎使用Hadoop和Spark集群进行大量作业处理，每天的Shuffle量达到3PB以上，单个作业的Shuffle量最大接近100TB。为了保证稳定性，知乎使用了ESS作为Spark的Shuffle服务。然而，ESS存在一些限制，如大量的随机IO导致磁盘IOPS瓶颈，降低作业性能和稳定性。知乎经常遇到IO负载高的节点导致作业耗时不稳定、失败等问题。解决这些问题的方法是减少Shuffle Read Block的数量和大小。

知乎技术

基于 Native 技术加速 Spark 计算引擎

本文介绍了如何通过将Spark的计算模式改为按列计算，并使用C++语言重写逻辑，来提升Spark计算引擎的性能。文章详细讨论了重写Spark SQL内核的工作量和Databricks已实现的闭源C++版本SQL内核。同时，也提出了可以选择一个性能强大的开源引擎，并改造为符合要求的SQL内核，以减少人力成本。最后，文章展示了将ClickHouse作为Spark SQL的示意图。通过改造Spark引擎并利用ClickHouse的优势，可以显著提高性能。

百度技术