深入探讨Psyberg:无状态与有状态数据处理

Let’s use the signup fact table as an example here. This table’s workflow runs hourly, with the main input source being an Iceberg table storing all raw signup events partitioned by landing date, hour, and batch id.

让我们以注册事实表为例。该表的工作流每小时运行一次,主要输入源是一个Iceberg表,该表存储了按登陆日期、小时和批次ID分区的所有原始注册事件。

Here’s a YAML snippet outlining the configuration for this during the Psyberg initialization step:

以下是在Psyberg初始化步骤中配置此项的YAML片段:

- job:
id: psyberg_session_init
type: Spark
spark:
app_args:
- --process_name=signup_fact_load
- --src_tables=raw_signups
- --psyberg_session_id=20230914061001
- --psyberg_hwm_table=high_water_mark_table
- --psyberg_session_table=psyberg_session_metadata
- --etl_pattern_id=1

- job:
id: psyberg_session_init
type: Spark
spark:
app_args:
- --process_name=signup_fact_load
- --src_tables=raw_signups
- --psyberg_session_id=20230914061001
- --psyberg_hwm_table=high_water_mark_table
- --psyberg_session_table=psyberg_session_metadata
- --etl_pattern_id=1

Behind the scenes, Psyberg identifies that this pipeline is configured for a stateless pattern since etl_pattern_id=1.

在幕后,Psyberg识别出此流水线配置为无状态模式,因为etl_pattern_id=1

Psyberg also uses the provided inputs to detect the Iceberg snapshots that persisted after the latest high watermark available in the watermark table. Using the summary column in snapshot metadata [see the Iceberg Metadata section in post 1 for more details], we parse out the partition information for each Iceberg snapshot of the source table.

Psyberg还使用提供的输入来检测在水印表中可用的最新高水位标记之后持久化的Iceberg快照。使用快照元数据中的summary列 [有关更多详细信息,请参见第1篇文章中的Iceberg元数据部分],我们解析出源表的每个Iceberg快照的分区信息。

Psyberg then retains these processing URIs (an array of JSON strings containing combinations of landing date, hour, and batch IDs) as determined by the snapshot changes. This information and other calculated metadata are stored in the psyberg_session_f table. This stored data is then available for the subsequent LOAD.FACT_TABLE job in the workflow to utilize and for analysis and debugging purposes.

然后,Psyberg根据快照变化确定这些处理URI(包含登陆日期、小时和批次I...

开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.125.0. UTC+08:00, 2024-05-03 15:13
浙ICP备14020137号-1 $访客地图$