非结构化数据的智慧:从大文本数据构建Airbnb的列表知识
How Airbnb leverages ML/NLP to extract useful information about listings from unstructured text data to power personalized experiences for guests.
Airbnb如何利用ML/NLP从非结构化文本数据中提取有关列表的有用信息,为客人提供个性化体验。
By: Hongwei Li and Peng Wang
Introduction
介绍
At Airbnb, it’s important for us to gather structured data about listings and better understand the data, so we can help Hosts provide great experiences for guests. For example, guests who work remotely need to know if a listing has a suitable workspace and reliable internet, while guests with children might need items like highchairs and cribs. However, not all listings clearly display these attributes, causing there to be a mismatch between what Hosts listings have and what guests are looking for.
在Airbnb,我们重视收集有关房源的结构化数据,并更好地理解这些数据,以便帮助房东为客人提供出色的体验。例如,远程工作的客人需要知道房源是否有合适的工作空间和可靠的互联网,而带孩子的客人可能需要高脚椅和婴儿床等物品。然而,并非所有房源都清楚地显示这些属性,导致房东的房源与客人的需求不匹配。
This is just one of many examples of how we can use the unstructured data generated on our platform, including text data that has undergone anonymization steps from various text-based guest interactions with the platform, to extract useful structure data. Instead of relying on Hosts to manually input all the potential listing attributes, which would be tedious given the vast number of attributes guests care and inquire about, we developed a machine learning system called Listing Attribute Extraction Platform (LAEP) for extracting the structure data at scale. Note that the original name of the project is called LATEX (Listing ATtribute EXtraction) and it is cited in our previous tech blog. We have since renamed the project to LAEP.
这只是我们在平台上生成的非结构化数据的一个例子,包括经过匿名化处理的文本数据,这些数据是从与平台进行的各种基于文本的客户互动中提取出来的,以提取有用的结构化数据。我们开发了一个名为Listing Attribute Extraction Platform(LAEP)的机器学习系统,用于大规模提取结构化数据,而不是依靠主机手动输入所有潜在的列表属性,这将是繁琐的,因为客人关心和询问的属性数量庞大。请注意,该项目的原始名称为LATEX(Listing ATtribute EXtraction),并在我们的先前的技术博客中引用。此后,我们将该项目更名为LAEP。
LAEP automatically extracts structured information, such as listing attributes, directly from the unstructur...