元数据驱动:构建下一代智能数据架构的探索与实践
如果无法正常显示,请先停止浏览器的去广告插件。
1. 元数据驱动:构建下一代智
能数据架构的探索与实践
邵赛赛
Datastrato联合创始及CTO
2. About me
●
●
●
●
Co-founder & CTO of
Datastrato.
The original creator of
Apache Gravitino.
The committer and PMC
member of Apache Spark.
Apache Member.
3.
4. 数据架构的历史演变
Goal
3Vs - Volume, Velocity, and
Variety.
5. 现有数据架构的问题和挑战
6. 现状 1 – 追求数据架构3V的回报越来越低
存储密度、网络带宽的快速增长,构
建大规模的高密度存储架构
变得更为简单
随着摩尔定律的终结,榨取CPU的
性能价值变得越来越难
●
●
●
https://www.researchgate.net/figure/Advanced-Storage-Technology-Consortium-ASTC-
roadmap-for-the-future-hard-disk-drive_fig3_310594004
https://www.lightwaveonline.com/directory/components/optical-
switches/article/14301154/the-world-runs-on-ethernet-the-future-of-higher-speeds
https://netfuture.ch/2024/10/50-years-of-microprocessor-trend-data/
7. 现状 2 – 使用和管理数据栈的复杂度还在持续上升
Data engineers/
analyst/scientists
Data steward
DataOps
engineer
https://a16z.com/emerging-architectures-for-modern-data-infrastructure/
8. 现状 3 – 迈向智能化是事物发展的基本规律
以汽车工业举例
Speed car
?
Autonomous car
9. 围绕AI构建的下一代数据架构
Data
warehousing
Data lake
BI
ETL/ELT
Data
analytics
Data
pipeline
Data
science
Data
governance
Data
observability
AI powered
knowledge base
- Data Brain
10. 在AGI时代如何重塑我们的数据架构
11. 使用元数据构建数据架构的大脑
元数据是"关于数据的数据",即描述、解释或为其他数据提供上下文的结构化信息。
● 上下文理解与发现:元数据充当"数据字典",描述了数据集的内容、结构和内部关系
,帮助用户理解和发现数据资源。
● 协调与治理:元数据在不同系统与团队之间架起了沟通的桥梁,是实现数据有效协调
与治理的基础。通过记录数据的来源、格式、访问权限等管理性信息和技术细节,元
数据确保了数据在流转过程中的一致性和可控性。
● 决策支持与自动化:元数据为智能自动化提供了关键指引,它能够指导工具如何正确
地处理数据,从而支撑高效的决策和自动化流程。
● 组织与管理:除了上述作用,元数据还通过提供分类、索引和权限等信息,帮助对大
量数据进行高效的组织和管理,确保数据的可靠性和可用性。
使用元数据和大模型结合构建出数据架构的大脑
12. 使用MCP连接器构建数据栈的左膀右臂
架构优势:
Governance
Agent
Analytic
Agent
分析工具
audit
治理工具
classification
discovery
lineage
…
Maintenance
Agent
维护工具
maintenance
compaction
optimize
…
TTL
数据大脑
structured data
semi-structured data
unstructured data
● 依托元数据及大模型构建的数据大脑
进行任务决策。
● 将不同的工具栈MCP化,使得工具栈
可以使用数据大脑驱动。
13. 如何使用Apache Gravitino构建智能数据架构
14. 什么是Apache Gravitino TM
Apache Gravitino - Catalog of Catalogs (Metadata Lake)
Metadata Lake built with
Hive
Metastore
Hadoop Data Lake
Built-in Catalog
Data Warehouse
Schema Registry
Streaming Processing
Gravitino
Key features:
SSOT AI + Data
Catalog
Geo-distributed
Arch Catalog +
Governance
Model Registry
Machine Learning
Security in
One place
15. 使用Gravitino构建数据大脑
数据大脑
LLM
VectorDB
MCP Tools
Gravitino
Unified REST APIs
Catalog service
Metalake
Catalog Catalog Catalog Catalog
Schema Schema Schema Schema
Table Fileset Model Topic
Metadata
Storage
Connec-
tion Connec-
tion
Tabular Files
Connec-
tion Connec-
tion
Models Message
Queue
16. 使用Gravitino构建数据执行器
Unified REST APIs
Iceberg REST APIs
* Policy system
* Job system
* Action framework
Catalog service
Metalake
Catalog
Policy
Policy
Policy
Schema
Catalog
Policy
Schema
Policy
Table
Catalog
Policy
Policy
Model
Policy
Schema
Policy
Policy
Fileset
Policy
Schema
Catalog
Topic
Policy
Job system
Policy
system
Job
Job
Action framework
TTL
Action
Compaction
Action
Clustering
Action
…
Job
…
17. 串联整个架构
Unified REST APIs
Iceberg REST APIs
Catalog service
Catalog
Policy
Schem
a
Metalake
Catalog
Policy
Schem
a
Policy
Policy
Table
Policy
metrics
Catalog
Policy
Schem
a
Model
Policy system
Topic
Policy
Policy
metrics
Policy
Policy
Policy
Fileset
Policy
Schem
a
Catalog
Job system
Job
Statistics system
Job
Job
…
MCP
MCP
MCP
Action Framework
TTL
Action
Agent
card
Policy
Agent
Clustering
Action
Compaction
Action
…
MCP
A2A
Agent
card
Meta
Agent
A2A
Agent
card
Action
Agent
A2A
Job Agent
Agent
card
* Meta agent - data discovery and
context understanding
* Policy agent - check and build
policies on metadata
* Action agent - generate a job for
the specific action
* Job agent - submit a job to the
job system
18. Demo
19. 使用Gravitino构建智能数据架构
Building data pipelines using agents (Data engineering)
Data
engineer Data
engineer
Investigate the
requirements Investigate the
requirements
Explore the
related data Explore the
related data
Meta agent
Build the
pipeline
Build the
pipeline
Generate the
DAG
Submit and run
Query
agent
Generate the
DAG
Submit and run
20. Demo
Data Engineering
first-demo.mp4
21. 使用Gravitino构建智能数据架构
Data classification and labeling (Data governance)
Data steward
Data steward
Investigate the
policies/laws
Investigate the
policies/laws
Identify the
metadata
Identify the
metadata
Meta agent
Apply the policies
and tags
Apply the policies
and tags
Verify the data
Governance
agent
Query agent
Verify the data
22. Demo
Data governance
second-demo.mp4
22
23. 总结
24. 总结和展望
● 数据架构向智能化迈进是一个不可逆的趋势
● 元数据在构建智能化数据架构中起到了不可替代的作用
● 结合大模型构建数据大脑,并由数据大脑驱动数据工作
大幅解放数据工作
25.
26. THANKS
大模型正在重新定义软件
Large Language Model Is Redefining The Software