大规模数据实体的 LLM 驱动型数据分类
At Grab, we deal with PetaByte-level data and manage countless data entities ranging from database tables to Kafka message schemas. Understanding the data inside is crucial for us, as it not only streamlines the data access management to safeguard the data of our users, drivers and merchant-partners, but also improves the data discovery process for data analysts and scientists to easily find what they need.
在 Grab,我们处理 PB 级别的数据,并管理从数据库表到 Kafka 消息模式等无数数据实体。了解其中的数据对我们至关重要,因为它不仅可以简化数据访问管理以保护我们的用户、司机和商家合作伙伴的数据,还可以改进数据分析师和科学家的数据发现过程,使他们轻松找到所需的内容。
The Caspian team (Data Engineering team) collaborated closely with the Data Governance team on automating governance-related metadata generation. We started with Personal Identifiable Information (PII) detection and built an orchestration service using a third-party classification service. With the advent of the Large Language Model (LLM), new possibilities dawned for metadata generation and sensitive data identification at Grab. This prompted the inception of the project, which aimed to integrate LLM classification into our existing service. In this blog, we share insights into the transformation from what used to be a tedious and painstaking process to a highly efficient system, and how it has empowered the teams across the organisation.
Caspian团队(数据工程团队)与数据治理团队密切合作,自动化治理相关的元数据生成。我们从个人可识别信息(PII)检测开始,并使用第三方分类服务构建了一个编排服务。随着大型语言模型(LLM)的出现,为Grab的元数据生成和敏感数据识别带来了新的可能性。这促使了该项目的开始,旨在将LLM分类集成到我们现有的服务中。在本博客中,我们分享了从曾经是繁琐而费力的过程到高效系统的转变的见解,以及它如何赋予组织各个团队的能力。
For ease of reference, here’s a list of terms we’ve used and their definitions:
为了方便参考,这里列出了我们使用的术语及其定义:
- Data Entity: An entity representing a schema that contains rows/streams of data, for example, database tables, stream messages, data lake tables.
- 数据实体:表示包含数据行/流的模式的实体,例如数据库表、流消息、数据湖表。
- Prediction: Refers to the model’s output given a data entity, unverified manually.
- 预测:指模型在未经人工验证的情况下给出的数据实体输出。
- Data Classification: The process of classifying a given data entity, which in the context of this blog, involves generating tags that r...