The Community for Technology Leaders
Rules and Rule Markup Languages for the Semantic Web, International Conference on (2006)
Athens, Georgia, USA
Nov. 10, 2006 to Nov. 11, 2006
ISBN: 0-7695-2652-7
pp: 107-116
Kai Simon , Universitat Freiburg, Germany
Thomas Hornung , Universitat Freiburg, Germany
Georg Lausen , Universitat Freiburg, Germany
Web pages such as product catalogues and web sites resulting from querying a search engine often follow a global layout template which facilitates the retrieval of information for a user. In this paper we present a technique which makes such content machine-processable by extracting and transforming it into tabular form. We achieve this goal via ViPER, our fully automatic wrapper system, while localizing and extracting structured data records from suchlike web pages following a sophisticated strategy based on the visual perception of a web page. <p>The first contribution of this paper is to give deep insight into the post-processing heuristics of ViPER, which become materialized by a set of rules. Once these rules are defined, the regular content of a web page can be abstracted into a relational view. Second, we show that new, unseen contents rendered with the same layout, only have to be extracted by ViPER, whereas the remaining transformation can be performed by applying the learned rules accordingly.</p>

