PinPoint:一种用于网页的神经感应式属性提取器

Jinfeng Zhuang | Software Engineer, ATG Applied Science

庄金凤 | 软件工程师,ATG应用科学公司

Despite the explosive growth of the internet over the past couple of decades, much of the digitized knowledge has been curated for human understanding and has stayed unfriendly for machine comprehension. Even promising efforts towards creating semantic web like the Resource Description Framework in Attributes (RDFA), Ontology Web Language (OWL), JSON-LD, and Open Graph Protocol are in infancy and fall short for commercial applications due to data sparsity and high variance in data quality across websites. Hence Web Information Extraction (WIE), colloquially known as scraping, is the dominant knowledge acquisition strategy for several organizations in advertising, commerce, search engines, travel, etc. For our purposes, Pinterest uses this approach to bring high-level information (like price and product description) from saved websites to the Pin-level, to help provide Pinners with more information, along with a link back to the original website for more details, and to ultimately take action.

尽管在过去的几十年里,互联网有了爆炸性的增长,但许多数字化的知识都是为人类理解而策划的,对机器的理解并不友好。即使在创建语义网络方面做出了有希望的努力,如属性资源描述框架(RDFA)、本体网络语言(OWL)、JSON-LD和开放图谱协议,但由于数据稀少和各网站数据质量的高度差异,这些努力仍处于起步阶段,无法满足商业应用。因此,网络信息提取(WIE),俗称刮削,是广告、商业、搜索引擎、旅游等领域一些组织的主要知识获取策略。就我们的目的而言,Pinterest使用这种方法将保存的网站上的高级信息(如价格和产品描述)带到Pin级别,以帮助为Pinners提供更多的信息,同时提供回到原始网站的链接以获取更多的细节,并最终采取行动。

Formally, Web Attribute Extraction (a sub-branch of WIE) is defined as the task of extracting a value for domain-specific attributes such as Title:Marissa Henley Dress, Price:128.00, ItemID:№58632704, etc. for an entity (product in this case) from a webpage as illustrated in Figure 1. This is a challenging task for simple rule-based extraction systems (like XPath, Regex) because:

从形式上看,网络属性提取(WIE的一个分支)被定义为从网页中提取特定领域的属性值,如Title:Marissa Henley Dress, Price:128.00, ItemID:№58632704等,如图1所示,是对一个实体(本例中为产品)的提取。对于简单的基于规则的提取系统(如XPath、Regex)来说,这是一个具有挑战性的任务,因为。

  • Each website has a unique layout, making it impossible to configure a universal rule
  • 每个网站...
开通本站会员,查看完整译文。

首页 - Wiki
Copyright © 2011-2024 iteam. Current version is 2.125.1. UTC+08:00, 2024-05-17 10:32
浙ICP备14020137号-1 $访客地图$