Metasense V2:增强、改进和生产化由LLM驱动的数据治理
In the initial article, LLM Powered Data Classification, we addressed how we integrated Large Language Models (LLM) to automate governance-related metadata generation. The LLM integration enabled us to resolve challenges in Gemini, such as restrictions on the customisation of machine learning classifiers and limitations of resources to train a customised model. Gemini is a metadata generation service built internally to automate the tag generation process using a third-party data classification service. We also focused on LLM-powered column-level tag classifications. The classified tags, combined with Grab’s data privacy rules, allowed us to determine sensitivity tiers of data entities. The affordability of the model also enables us to scale it to cover more data entities in the company. The initial model scanned more than 20,000 data entries, at an average of 300-400 entities per day. Despite its remarkable performance, we were aware that there was room for improvement in the areas of data classification and prompt evaluation.
在最初的文章LLM Powered Data Classification中,我们讨论了如何集成大型语言模型(LLM)来自动生成与治理相关的元数据。LLM的集成使我们能够解决Gemini中的挑战,例如机器学习分类器定制的限制和训练定制模型的资源限制。Gemini是一个内部构建的元数据生成服务,使用第三方数据分类服务来自动化标签生成过程。我们还专注于基于LLM的列级标签分类。分类后的标签结合Grab的数据隐私规则,使我们能够确定数据实体的敏感性等级。模型的经济性也使我们能够扩展它以覆盖公司中的更多数据实体。初始模型扫描了超过20,000个数据条目,平均每天300-400个实体。尽管其表现出色,我们仍然意识到在数据分类和提示评估方面有改进的空间。
Improving the model post-rollout
改进模型后的推出
Since its launch in early 2024, our model has gradually grown to cover the entire data lake. To date, the vast majority of our data lake tables have undergone analysis and classification by our model. This has significantly reduced the workload for Grabbers. Instead of manually classifying all new or existing tables, Grabbers can now rely on our model to assign the appropriate classification tier accurately.
自2024年初推出以来,我们的模型逐渐覆盖了整个数据湖。迄今为止,我们数据湖中的绝大多数表格都已由我们的模型进行分析和分类。这大大减少了Grabbers的工作量。Grabbers现在可以依靠我们的模型准确地分配适当的分类层,而不必手动分类所有新的或现有的表格。
Despite table classification being automated, the data pipeline still requires owners to manu...