Uber has one of the largest Hadoop® Distributed File System (HDFS) deployments in the world, with exabytes of data across tens of clusters. It is important, but also challenging, to keep scaling our data infrastructure with the balance between efficiency, service reliability, and high performance. As a cost efficiency improvement effort that will save us tens of millions dollars every year, we aim to adopt higher density HDD (16+TB) SKUs to replace existing SKUs with 4TB HDDs that are still used by the majority of our HDFS clusters.
One of the biggest challenges when fully adopting high-density disk SKU comes from the disk IO bandwidth. While the capacity of each HDD increases by 2x to 4x, the I/O bandwidth of each HDD does not increase accordingly. This may cause IO throttling when DataNodes serve read/write requests. This can be seen from the chart below, which shows the trend of slow read packet read count from one DataNode. Given the persistent and sizable number of slow read occurrences, it is important to find new approaches to prevent performance degradation.