DataMesh:Uber如何为数据湖云迁移奠定基础
Uber’s batch data platform is used by over 10,000 active internal users, ranging from data scientists, city operations, and business analysts to engineers. It hosts around 1.5 exabytes of Apache HadoopⓇ Distributed File System (HDFS) storage across two on-prem regions, serving over 500,000 Presto queries and over 370,000 Apache SparkTM apps daily.
Uber的批量数据平台由超过10,000名活跃的内部用户使用,包括数据科学家、城市运营人员、业务分析师和工程师。它在两个本地区域上托管了大约1.5 exabytes的Apache HadoopⓇ分布式文件系统(HDFS)存储,每天提供超过500,000个Presto查询和超过370,000个Apache SparkTM应用程序。
In this blog, we delve into the details of how Uber laid the foundations for the batch data cloud migration by incorporating key data mesh principles.
在本博客中,我们将深入探讨Uber如何通过整合关键的数据网格原则为批量数据云迁移奠定基础。
Cloud providers have various limits on storage and IAM policies that pose challenges while migrating batch data to the cloud. Our major considerations while planning the cloud migration were:
云服务提供商在存储和IAM策略上有各种限制,这给批量数据迁移到云端带来了挑战。我们在规划云迁移时的主要考虑因素是:
-
Optimal Data Mapping: Map HDFS files and directories to storage buckets in a Goldilocks manner–not map too much data to too few buckets, thus running into per-project or per-bucket quotas, and not spread the data out to too many buckets to run into the overhead of maintaining too many buckets.
最佳数据映射:以金发姑娘的方式将HDFS文件和目录映射到存储桶中,不要将太多的数据映射到太少的存储桶中,以免超过项目或存储桶的配额,也不要将数据分散到太多的存储桶中,以免维护过多的存储桶带来的开销。
-
Access Control: Place access controls at the appropriate level in the storage hierarchy without running into hard limits from the cloud providers, while also not overly elevating privileges for existing users.
访问控制:在存储层次结构的适当级别上放置访问控制,避免遇到云服务提供商的硬性限制,同时也不过度提升现有用户的权限。
Additionally, we also saw the cloud migration as an opportunity to make improvements to our data lake:
此外,我们还将云迁移视为改进数据湖的机会:
-
Security group consolidation: Address the issue of proliferation of security groups with overlaps and consolidate users into fewer groups such that per-user access stays the same.
安全组合并:解决安全组过多且重叠的问题,将用户合并到较少的组中,以保持每个用户的访问权限不变。
-
Decentralized data ownership: Cleanly map Hive DBs and tables ...