大到无法查询:如何以最小的痛苦查询HBase
In today’s world of media consumption and engagement, we on the Vimeo Analytics team had to find ways to scale and handle the massive amounts of analytical data growth that we experienced during the COVID era.
在当今媒体消费和参与的世界里,我们维密分析团队必须找到方法来扩展和处理我们在COVID时代经历的大量分析数据增长。
Video analytics at Vimeo has been supported for the most part by an HBase cluster, consisting of more than 100 machines atop Apache Phoenix. However, as our daily data growth exponentially exploded during the start of the pandemic, our cluster suffered from growing pains, where linear horizontal or vertical scaling wasn’t enough to support additional use cases and the increasing demand. Therefore, we developed a way to query Phoenix/HBase using Apache Spark from HBase snapshots with minimal impact on the cluster.
Vimeo的视频分析在很大程度上是由一个HBase集群支持的,该集群由Apache Phoenix上的100多台机器组成。然而,随着我们的日常数据增长在大流行的开始阶段呈指数增长,我们的集群遭受了成长的痛苦,线性水平或垂直扩展不足以支持额外的用例和日益增长的需求。因此,我们开发了一种方法,使用Apache Spark从HBase快照查询Phoenix/HBase,对集群的影响最小。
From United Nations Covid-19 Response — by Sanket Deshmukh
来自联合国Covid-19的回应 - 作者:Sanket Deshmukh
What problem are we trying to solve
我们要解决的是什么问题
As you might imagine, collecting, storing, processing, and providing video user analytics such as views, plays, watch time, engagement performance, social analytics, and other, much more complex metrics, all at scale, isn’t easy.
正如你可能想象的那样,收集、存储、处理和提供视频用户分析,如浏览量、播放量、观看时间、参与性能、社交分析以及其他更复杂的指标,所有这些都是大规模的,并不容易。
Our legacy HBase cluster atop Apache Phoenix stores hundreds of terabytes of video metrics and statistics, over 10 billion writes and 100 billion queries/requests a day. And this keeps going up on a daily basis.
我们位于Apache Phoenix之上的传统HBase集群存储了数百兆字节的视频指标和统计数据,每天有超过100亿次写入和1000亿次查询/请求。而这一数字每天都在不断上升。
As a result, our problem consists of the following attributes:
因此,我们的问题由以下属性组成。
- Scale. There is a point where scaling up horizontally or vertically hits saturation, where adding more resources just doesn’t make sense economically or pragmatically anymore.
- 规模。在某一点上,横向或纵向的扩展达到了饱和,增加更多的资源在经济上或实用上不再有意义。
- Performance...