DataK9:通过AI/ML在字段级别自动对一艾字节的数据进行分类
Data categorization–the process of classifying data based on its characteristics and essence–is a foundational pillar of any privacy or security program. The effectiveness of fine-grained data categorization is pivotal in implementing privacy and security controls, such as access policies and encryption, as well as managing the lifecycle of data assets, encompassing retention and deletion. This blog delves into Uber’s approach to achieving data categorization at scale by leveraging various AI/ML techniques.
数据分类-根据数据的特征和本质对数据进行分类的过程-是任何隐私或安全程序的基础支柱。细粒度数据分类的有效性在实施隐私和安全控制(如访问策略和加密)以及管理数据资产的生命周期(包括保留和删除)方面至关重要。本博客深入探讨了Uber利用各种AI/ML技术实现大规模数据分类的方法。
- Sheer Scale and Cost: Many companies manage extensive datasets distributed across various storage systems. This scale is compounded by the generation of new datasets regularly. Central to our challenge is tagging numerous columns at the field level, each potentially requiring multiple tags. Manual tagging demands significant time and resources. Additionally, an ongoing investment is required to tag newly created datasets.
- 规模和成本:许多公司管理着分布在各种存储系统中的大量数据集。这种规模还会随着新数据集的生成而增加。我们面临的主要挑战是在字段级别对许多列进行标记,每个列可能需要多个标记。手动标记需要大量时间和资源。此外,需要持续投资来标记新创建的数据集。
- Engagement of Data Owners: Identifying and engaging data owners presents a complex reality. With multiple tags to evaluate for each data element, it becomes intricate and time-consuming to discern nuances among tag definitions, leading to miscategorization.
- 数据所有者的参与:识别和参与数据所有者是一个复杂的现实。对于每个数据元素需要评估多个标签,区分标签定义之间的细微差别变得复杂且耗时,导致错误分类。
Given the monumental engineering cost and the impracticality of manual categorization through decentralized efforts, our experiences have led us to prioritize auto-categorization. In response, we have introduced a novel solution named DataK9, an automatic categorization platform for Uber data. The primary objective is minimizing and eliminating user involvement, thereby addressing the challenges posed by scale, cost, and data owner engagement.
考虑到巨大的工程成本和通过分散的努力进行手动分类的不切实际性,我们的经验使我们优先考虑自动分类。为此,我们引入了一种名为DataK...