Second International Conference on Rules and Rule Markup Languages for the Semantic Web (RuleML'06) Learning Rules to Pre-process Web Data for Automatic Integration Athens, Georgia, USA November 10-November 11 ISBN: 0-7695-2652-7
Web pages such as product catalogues and web sites resulting from querying a search engine often follow a global layout template which facilitates the retrieval of information for a user. In this paper we present a technique which makes such content machine-processable by extracting and transforming it into tabular form. We achieve this goal via ViPER, our fully automatic wrapper system, while localizing and extracting structured data records from suchlike web pages following a sophisticated strategy based on the visual perception of a web page. The first contribution of this paper is to give deep insight into the post-processing heuristics of ViPER, which become materialized by a set of rules. Once these rules are defined, the regular content of a web page can be abstracted into a relational view. Second, we show that new, unseen contents rendered with the same layout, only have to be extracted by ViPER, whereas the remaining transformation can be performed by applying the learned rules accordingly.
Citation:
Kai Simon, Thomas Hornung, Georg Lausen, "Learning Rules to Pre-process Web Data for Automatic Integration," ruleml, pp.107-116, Second International Conference on Rules and Rule Markup Languages for the Semantic Web (RuleML'06), 2006 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||