在Twitter实时处理数十亿的事件
By Lu Zhang and Chukwudiuto Malife
Friday, 22 October 2021
2021年10月22日,星期五
At Twitter, we process approximately 400 billion events in real time and generate petabyte (PB) scale data every day. There are various event sources we consume data from, and they are produced in different platforms and storage systems, such as Hadoop, Vertica, Manhattan distributed databases, Kafka, Twitter Eventbus, GCS, BigQuery, and PubSub.
在Twitter,我们每天实时处理大约4000亿个事件,产生PB级的数据。我们消费的数据有多种事件源,它们产生于不同的平台和存储系统,如Hadoop、Vertica、曼哈顿分布式数据库、Kafka、Twitter Eventbus、GCS、BigQuery和PubSub。
To process those types of data in those sources and platforms, the Twitter Data Platform team has built internal tools like Scalding for batch processing, Heron for streaming, an integrated framework called TimeSeries AggregatoR (TSAR) for both batch and real-time processing, and Data Access Layer for data discovery and consumption. However, with data growing rapidly, the high scale is still challenging the data infrastructure that engineers use to run pipelines. For example, we have an interaction and engagement pipeline which processes high-scale data in batch and real time. As our data scale is growing fast, we face high demands to reduce streaming latency and provide higher accuracy on data processing, as well as real-time data serving.
为了处理这些来源和平台中的这些类型的数据,Twitter数据平台团队已经建立了内部工具,如用于批量处理的Scalding,用于流媒体的Heron,用于批量和实时处理的名为TimeSeries AggregatoR(TSAR)的综合框架,以及用于数据发现和消费的Data Access Layer。然而,随着数据的快速增长,高规模仍在挑战工程师用来运行管道的数据基础设施。例如,我们有一个互动和参与管道,它以批处理和实时处理高规模的数据。由于我们的数据规模快速增长,我们面临着减少流媒体延迟和提供更高的数据处理精度的高要求,以及实时数据服务。
For the interaction and engagement pipeline, we collect and process data from various real-time streams and server and client logs, to extract Tweet and user interaction data with various levels of aggregations, time granularities, and other metrics dimensions. That aggregated interaction data is particularly important and is the source of truth for Twitter’s ads revenue services and data product services to retrieve information on impres...