重新思考流处理:数据探索
In this digital age, companies collect multitudes of data that enable the tracking of business metrics and performance. Over the years, data analytics tools for data storage and processing have evolved from the days of Excel sheets and macros to more advanced Map Reduce model tools like Spark, Hadoop, and Hive. This evolution has allowed companies, including Grab, to perform modern analytics on the data ingested into the Data Lake, empowering them to make better data-driven business decisions. This form of data will be referenced within this document as “Offline Data”.
在这个数字时代,公司收集了大量的数据,这些数据可以用于跟踪业务指标和绩效。多年来,用于数据存储和处理的数据分析工具已经从Excel表格和宏的时代发展到更先进的Map Reduce模型工具,如Spark、Hadoop和Hive。这种演变使得包括Grab在内的公司能够对导入数据湖中的数据进行现代化分析,从而使他们能够做出更好的数据驱动的业务决策。本文中将引用这种数据形式为“离线数据”。
With innovations in stream processing technology like Spark and Flink, there is now more interest in unlocking value from streaming data. This form of continuously-generated data in high volume will be referenced within this document as “Online Data”. In the context of Grab, the streaming data is usually materialised as Kafka topics (“Kafka Stream”) as the result of stream processing in its framework. This data is largely unexplored until they are eventually sunk into the Data Lake as Offline Data, part of the data journey (see Figure 1 below). This induces some data latency before the data can be used by data analysts to inform decisions.
随着 Spark 和 Flink 等流处理技术的创新,现在越来越多的人对从流式数据中释放价值感兴趣。本文将持续生成的高容量数据称为“在线数据”。在 Grab 的上下文中,流式数据通常作为 Kafka 主题(“Kafka Stream”)的结果在其框架中实现。在这些数据最终沉入数据湖作为离线数据之前,它们大部分都是未被探索的,这是数据旅程的一部分(参见下图 1)。这导致数据在被数据分析师用于决策之前存在一定的延迟。
Figure 1. Simplified data journey for Offline Data vs. Online Data, from data generation to data analysis.
图 1. 离线数据与在线数据的简化数据流程,从数据生成到数据分析。
As seen in Figure 1 above, the Time to Value (“TTV”) of Online Data is shorter as compared to that of Offline Data in a simplified data journey from data generation to data analysis where complexities of data cleaning and transformation have been removed. This is because the role...