在Uber DataLake中平衡HDFS DataNodes

Apache HadoopⓇ Distributed File System (HDFS) is a distributed file system designed to store large files across multiple machines in a reliable and fault-tolerant manner. It is part of the Apache Hadoop framework and is one of the main components of Uber’s data stack.

Apache HadoopⓇ分布式文件系统(HDFS)是一个设计用于可靠且容错的方式在多台机器上存储大文件的分布式文件系统。它是Apache Hadoop框架的一部分,也是Uber数据堆栈的主要组成部分之一。

Uber has one of the largest HDFS deployments in the world, with exabytes of data across tens of clusters. It is important, but also challenging, to keep scaling our data infrastructure with the balance between efficiency, service reliability, and high performance.

Uber拥有全球最大的HDFS部署之一,跨越数十个集群的数据达到了艾克赛字节级。在效率、服务可靠性和高性能之间保持数据基础设施的平衡是重要的,但也具有挑战性。

Image

Figure 1: HDFS Infrastructure at Uber.

图1:Uber的HDFS基础设施。

HDFS balancer is a key component to keep DataNodes healthy by redistributing data evenly in the cluster. The HDFS balancer has to balance data more effectively to prevent DataNode skew as our HDFS clusters have more and more intensive node decommissioning. The node decommission requirement comes from projects such as zone decommissioning, automatic cluster turnover for security patch, and also DataNode colocation.

HDFS balancer是保持DataNodes健康的关键组件,通过在集群中均匀重新分配数据来实现。随着我们的HDFS集群越来越频繁地进行节点退役,HDFS balancer必须更有效地平衡数据,以防止DataNode的不均衡。节点退役的需求来自于诸如区域退役、安全补丁的自动集群更替以及DataNode的共存等项目。

However, the balancer that comes with HDFS open source did not meet this requirement out of the box. We have seen issues of one DataNode being skewed (i.e., storing more data compared to other nodes in the same cluster), which has multiple side effects:

然而,HDFS开源版本自带的平衡器无法满足这个要求。我们发现一个DataNode存在偏斜的问题(即与同一集群中的其他节点相比存储更多的数据),这会产生多个副作用:

  • Leads to high I/O bandwidth on the host containing too much data
  • 导致包含过多数据的主机上的高I/O带宽
  • Highly utilized nodes have a higher probability of slowness, higher risk of node failure, data loss
  • 使用率高的节点更容易出现缓慢、节点故障和数据丢失的风险更高
  • Cluster has fewer active and healthy nodes to serve writing traffic for customers
  • 集群中可用于为客户提供写入流量的活动和健康节点较少

Below is an example of unbalanced da...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.125.0. UTC+08:00, 2024-05-04 09:12
浙ICP备14020137号-1 $访客地图$