元数据驱动:构建下一代智能数据架构的探索与实践

如果无法正常显示,请先停止浏览器的去广告插件。
分享至:
1. 元数据驱动:构建下一代智 能数据架构的探索与实践 邵赛赛 Datastrato联合创始及CTO
2. About me ● ● ● ● Co-founder & CTO of Datastrato. The original creator of Apache Gravitino. The committer and PMC member of Apache Spark. Apache Member.
3.
4. 数据架构的历史演变 Goal 3Vs - Volume, Velocity, and Variety.
5. 现有数据架构的问题和挑战
6. 现状 1 – 追求数据架构3V的回报越来越低 存储密度、网络带宽的快速增长,构 建大规模的高密度存储架构 变得更为简单 随着摩尔定律的终结,榨取CPU的 性能价值变得越来越难 ● ● ● https://www.researchgate.net/figure/Advanced-Storage-Technology-Consortium-ASTC- roadmap-for-the-future-hard-disk-drive_fig3_310594004 https://www.lightwaveonline.com/directory/components/optical- switches/article/14301154/the-world-runs-on-ethernet-the-future-of-higher-speeds https://netfuture.ch/2024/10/50-years-of-microprocessor-trend-data/
7. 现状 2 – 使用和管理数据栈的复杂度还在持续上升 Data engineers/ analyst/scientists Data steward DataOps engineer https://a16z.com/emerging-architectures-for-modern-data-infrastructure/
8. 现状 3 – 迈向智能化是事物发展的基本规律 以汽车工业举例 Speed car ? Autonomous car
9. 围绕AI构建的下一代数据架构 Data warehousing Data lake BI ETL/ELT Data analytics Data pipeline Data science Data governance Data observability AI powered knowledge base - Data Brain
10. 在AGI时代如何重塑我们的数据架构
11. 使用元数据构建数据架构的大脑 元数据是"关于数据的数据",即描述、解释或为其他数据提供上下文的结构化信息。 ● 上下文理解与发现:元数据充当"数据字典",描述了数据集的内容、结构和内部关系 ,帮助用户理解和发现数据资源。 ● 协调与治理:元数据在不同系统与团队之间架起了沟通的桥梁,是实现数据有效协调 与治理的基础。通过记录数据的来源、格式、访问权限等管理性信息和技术细节,元 数据确保了数据在流转过程中的一致性和可控性。 ● 决策支持与自动化:元数据为智能自动化提供了关键指引,它能够指导工具如何正确 地处理数据,从而支撑高效的决策和自动化流程。 ● 组织与管理:除了上述作用,元数据还通过提供分类、索引和权限等信息,帮助对大 量数据进行高效的组织和管理,确保数据的可靠性和可用性。 使用元数据和大模型结合构建出数据架构的大脑
12. 使用MCP连接器构建数据栈的左膀右臂 架构优势: Governance Agent Analytic Agent 分析工具 audit 治理工具 classification discovery lineage … Maintenance Agent 维护工具 maintenance compaction optimize … TTL 数据大脑 structured data semi-structured data unstructured data ● 依托元数据及大模型构建的数据大脑 进行任务决策。 ● 将不同的工具栈MCP化,使得工具栈 可以使用数据大脑驱动。
13. 如何使用Apache Gravitino构建智能数据架构
14. 什么是Apache Gravitino TM Apache Gravitino - Catalog of Catalogs (Metadata Lake) Metadata Lake built with Hive Metastore Hadoop Data Lake Built-in Catalog Data Warehouse Schema Registry Streaming Processing Gravitino Key features: SSOT AI + Data Catalog Geo-distributed Arch Catalog + Governance Model Registry Machine Learning Security in One place
15. 使用Gravitino构建数据大脑 数据大脑 LLM VectorDB MCP Tools Gravitino Unified REST APIs Catalog service Metalake Catalog Catalog Catalog Catalog Schema Schema Schema Schema Table Fileset Model Topic Metadata Storage Connec- tion Connec- tion Tabular Files Connec- tion Connec- tion Models Message Queue
16. 使用Gravitino构建数据执行器 Unified REST APIs Iceberg REST APIs * Policy system * Job system * Action framework Catalog service Metalake Catalog Policy Policy Policy Schema Catalog Policy Schema Policy Table Catalog Policy Policy Model Policy Schema Policy Policy Fileset Policy Schema Catalog Topic Policy Job system Policy system Job Job Action framework TTL Action Compaction Action Clustering Action … Job …
17. 串联整个架构 Unified REST APIs Iceberg REST APIs Catalog service Catalog Policy Schem a Metalake Catalog Policy Schem a Policy Policy Table Policy metrics Catalog Policy Schem a Model Policy system Topic Policy Policy metrics Policy Policy Policy Fileset Policy Schem a Catalog Job system Job Statistics system Job Job … MCP MCP MCP Action Framework TTL Action Agent card Policy Agent Clustering Action Compaction Action … MCP A2A Agent card Meta Agent A2A Agent card Action Agent A2A Job Agent Agent card * Meta agent - data discovery and context understanding * Policy agent - check and build policies on metadata * Action agent - generate a job for the specific action * Job agent - submit a job to the job system
18. Demo
19. 使用Gravitino构建智能数据架构 Building data pipelines using agents (Data engineering) Data engineer Data engineer Investigate the requirements Investigate the requirements Explore the related data Explore the related data Meta agent Build the pipeline Build the pipeline Generate the DAG Submit and run Query agent Generate the DAG Submit and run
20. Demo Data Engineering first-demo.mp4
21. 使用Gravitino构建智能数据架构 Data classification and labeling (Data governance) Data steward Data steward Investigate the policies/laws Investigate the policies/laws Identify the metadata Identify the metadata Meta agent Apply the policies and tags Apply the policies and tags Verify the data Governance agent Query agent Verify the data
22. Demo Data governance second-demo.mp4 22
23. 总结
24. 总结和展望 ● 数据架构向智能化迈进是一个不可逆的趋势 ● 元数据在构建智能化数据架构中起到了不可替代的作用 ● 结合大模型构建数据大脑,并由数据大脑驱动数据工作 大幅解放数据工作
25.
26. THANKS 大模型正在重新定义软件 Large Language Model Is Redefining The Software

首页 - Wiki
Copyright © 2011-2025 iteam. Current version is 2.147.1. UTC+08:00, 2025-11-03 06:48
浙ICP备14020137号-1 $访客地图$