[KDD 2020] FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

[KDD 2020] FreeDOM: A Transferable Neural Architecture for Structured Information Extraction on Web Documents

Dec 16, 2020
|
66 views
Details
Extracting structured data from HTML documents is a long-studied,problem with a broad range of applications like augmenting knowledge bases, supporting faceted search, and providing domain-specific,experiences for key verticals like shopping and movies. Previous,approaches have either required a small number of examples for,each target site or relied on carefully handcrafted heuristics built,over visual renderings of websites. In this paper, we present a novel,two-stage neural approach, named,FreeDOM,, which overcomes,both these limitations. The first stage learns a representation for,each DOM node in the page by combining both the text and markup,information. The second stage captures longer range distance and,semantic relatedness using a relational neural network. By combining these stages,,FreeDOM,is able to generalize to unseen sites after,training on a small number of seed sites from that vertical without,requiring expensive hand-crafted features over visual renderings of,the page. Through experiments on a public dataset with 8 different,verticals, we show that,FreeDOM,beats the previous state of the,art by nearly 3.7 F1 points on average,without,requiring features,over rendered pages or expensive hand-crafted features.

Comments
loading...