中间件与数据库：Apache Hadoop的相关资料

Resource Management with Apache YuniKorn™ for Apache Spark™ on AWS EKS at Pinterest

Monarch, Pinterest’s Batch Processing Platform, was initially designed to support Pinterest’s ever-growing number of Apache Spark and MapReduce workloads at scale. During Monarch’s inception in 2016, the most dominant batch processing technology around to build the platform was Apache Hadoop YARN. Now, eight years later, we have made the decision to move off of Apache Hadoop and onto our next generation Kubernetes (K8s) based platform.

pinterest技术

DataMesh: How Uber laid the foundations for the data lake cloud migration

Learn how Uber is streamlining the Cloud migration of its massive Data Lake by incorporating key Data Mesh principles.

uber技术

Enabling Security for Hadoop Data Lake on Google Cloud Storage

Uber's data lake is migrating to the cloud! Learn how they're tackling security challenges and scaling the system to handle massive amounts of data while ensuring a seamless transition for users.

uber技术

Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform

Uber计划将其批量数据分析和机器学习训练堆栈迁移到Google Cloud Platform（GCP）。他们将使用HiveSync和Hudi库来实现在两个区域之间保持数据湖同步，并将本地数据湖的数据复制到云端数据湖和对应的Hive Metastore。迁移后，他们将在GCP上为YARN和Presto集群提供新的IaaS，并通过现有的数据访问代理将流量路由到云端堆栈。迁移过程中可能会面临性能、成本管理、非分析/机器学习应用使用HDFS和未知挑战等问题，但他们计划通过改进开源连接器、利用云的弹性、迁移其他文件存储用例以及积极解决问题来解决这些挑战。

uber技术

在 ARM 环境下搭建原生 Hadoop 集群

基于 Hadoop 开源版本的部署，更加的贴近于 Hadoop 生态的蓬勃发展，并且能够更好地基于需求进行整个 Hadoop 集群的升级或优化。本文将介绍如何基于 Hadoop 开源版本完成一个大数据 Hadoop 集群的搭建。

政采云技术

中通hadoop去CDH的实践之路

中通快递创建于2002年5月8日，是一家以快递为核心业务，集跨境、快运、商业、云仓、航空、冷链、金融、智能、兔喜社区生活服务、中快数字营销等生态版块于一体的综合物流服务企业。2021年，中通快递全年业务量达到223亿件，同比增长31.1%。全网服务网点30,400+个，转运中心99个，直接网络合作伙伴5700+个，自有干线运输车辆10,900辆（其中超9000辆为高运力甩挂车），干线运输线路约3700条，网络通达99%以上的区县，乡镇覆盖率超过93%。科技中通大数据中心支撑了公司的业务，现在有了两个IDC，Hadoop集群规模达到了上千台，存储达到了18PB+，线上日活任务数2w+，目前，仍处在快速增长期。

如下图展示了一个快递的生命周期，五个字概括就是收发到派签。首先，客户通过线上或线下的方式和快递员取得联系，填写寄件人等信息，将快递A交给快递员。快递A经过称重、打单、扫描、包装等步骤，由快递员送往发件网点，此过程称为揽收。然后，发件网点的快递员将快递A进行建包、装包、装车，由发件网点发出，发往首转运中心。快递A经首转运中心被运输到末转运中心。快递A到达末转运中心后，快递员根据三段码的解析将快递A递交到收件网点，收件网点对快递A进行拆包和分拣，此过程包括发件和到件。快递A被分拣完，由快递员进行派件，最终快递A被送达到收件客户手里，收件客户完成签收。在快递A整个生命周期内，每个业务流程都会产生大量的数据，我们利用这些数据，可以追踪快递A的轨迹，分析快递A的运送时效，分析快递A的退改签等业务。当然，做上述事情的前提是，我们需要一个稳定的、计算高效的、海量存储的基础大数据平台。

中通技术

vivo 万台规模 HDFS 集群升级 HDFS 3.x 实践

从CDH集群滚动升级到HDP集群的实践案例。

vivo技术

hadoop任务常见的OOM问题及解决方案

本文我们主要介绍在使用MapReduce计算框架时发生java.lang.OutOfMemoryError的处理方式。

汽车之家技术

Efficiently Managing the Supply and Demand on Uber’s Big Data Platform

With Uber’s business growth and the fast adoption of big data and AI, Big Data scaled to become our most costly infrastructure platform. To reduce operational expenses, we developed a holistic framework with 3 pillars: platform efficiency, supply, and demand (using supply to describe the hardware resources that are made available to run big data storage and compute workload, and demand to describe those workloads). In this post, we will share our work on managing supply and demand. For more details about the context of the larger initiative and improvements in platform efficiency, please refer to our earlier posts: Challenges and Opportunities to Dramatically Reduce the Cost of Uber’s Big Data, and Cost-Efficient Open Source Big Data Platform at Uber.

uber技术

Cost-Efficient Open Source Big Data Platform at Uber

As Uber’s business has expanded, the underlying pool of data that powers it has grown exponentially, and thus ever more expensive to process. When Big Data rose to become one of our largest…

uber技术

Uber’s Finance Computation Platform

For a company of our size and scale, robust, accurate, and compliant accounting and analytics are a necessity, ensuring accurate and granular visibility into our financials, across multiple lines of business.

Most standard, off-the-shelf finance engineering solutions cannot support the scale and scope of the transactions on our ever-growing platform. The ride-sharing business alone has over 4 billion trips per year worldwide, which translates to more than 40 billion journal entries (financial microtransactions). Each of these entries has to be produced in accordance with Generally Accepted Accounting Principles (GAAP), and managed in an idempotent, consistent, accurate, and reproducible manner.

To meet these specific requirements, we built an in-house Uber’s Finance Computation Platform (FCP) —a solution designed to accommodate our scale, while providing strong guarantees on accuracy and explainability. The same solution also serves in obtaining insights on business operations.

There were many challenges in building our financial computation platform, from our architectural choices to the types of controls for accuracy and explainability.

uber技术

Containerizing Apache Hadoop Infrastructure at Uber

In 2019, we started a journey to re-architect the Hadoop deployment stack. Fast forward 2 years, over 60% of Hadoop runs in Docker containers, bringing major operational benefits to the team. As a result of the initiative, the team handed off many of their responsibilities to other infrastructure teams, and was able to focus more on core Hadoop development.

This article provides a summary of problems we faced, and how we solved them along the way.

uber技术