Uber 如何将数据复制扩展到每天移动 Petabytes

Uber prioritizes a reliable data lake, which is distributed across on-premise and cloud environments. This multi-region setup presents challenges for ensuring reliable and timely data access due to limited network bandwidth and the need for seamless data availability, particularly for disaster recovery. Uber uses the Hive Sync service, which uses Apache Hadoop® Distcp (Distributed Copy) for data replication. However, with Uber’s Data Lake exceeding 350 PB, Distcp’s limitations became apparent. This blog explores the optimizations made to Distcp to enhance its performance and meet Uber’s growing data replication and disaster recovery needs across its distributed infrastructure.

Uber 优先考虑可靠的数据湖,该数据湖分布在本地部署和云环境之间。这种多区域设置由于网络带宽有限以及无缝数据可用性的需求(特别是针对灾难恢复),在确保可靠和及时的数据访问方面面临挑战。Uber 使用 Hive Sync service,该服务使用 Apache Hadoop® Distcp (Distributed Copy) 进行数据复制。然而,随着 Uber 的 Data Lake 超过 350 PB,Distcp 的局限性变得明显。本文探讨了对 Distcp 进行的优化,以提升其性能并满足 Uber 在分布式基础设施中不断增长的数据复制和灾难恢复需求。

Distcp is an open-source framework for copying large datasets between different locations in a distributed manner. It uses Hadoop’s MapReduce framework to parallelize and distribute the copy tasks across multiple nodes, allowing for faster and more scalable data transfers, particularly in large-scale environments.

Distcp 是一个开源框架,用于以分布式方式在不同位置之间复制大型数据集。它使用 Hadoop 的 MapReduce 框架来并行化和分发复制任务到多个节点,从而实现更快、更具可扩展性的数据传输,特别是在大规模环境中。

Image

Figure 1: High-level Distcp architecture.

图 1:Distcp 高层架构。

The Distcp architecture comprises several key components:

Distcp 架构包含几个关键组件:

  • Distcp Tool: Identifies files, groups them into blocks (Copy Listing), defines distribution across mappers, and submits the configured Hadoop job to YARN.
  • Distcp Tool: 识别文件,将它们分组为块 (Copy Listing),定义跨 mapper 的分布,并将配置的 Hadoop 作业提交至 YARN。
  • Hadoop Client: Sets up the job environment, determines which mappers handle specific blocks (Input Splitting), and submits the job to YARN.
  • Hadoop Client: 设置作业环境,确定哪些 mappers 处理特定块(Input Splitting),并将作业提交到 YARN。
  • RM (Resource Manager): The YARN component that schedules tasks, rece...
开通本站会员,查看完整译文。

Home - Wiki
Copyright © 2011-2026 iteam. Current version is 2.148.4. UTC+08:00, 2026-01-31 05:46
浙ICP备14020137号-1 $Map of visitor$