loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06)
Interactive Tuples Extraction from Semi-Structured Data
Hong Kong, China
December 18-December 22
ISBN: 0-7695-2747-7
Remi Gilleron, INRIA Futurs and Lille University, France
Patrick Marty, INRIA Futurs and Lille University, France
Marc Tommasi, INRIA Futurs and Lille University, France
Fabien Torre, INRIA Futurs and Lille University, France
This paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tuples of length i - 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper.
Citation:
Remi Gilleron, Patrick Marty, Marc Tommasi, Fabien Torre, "Interactive Tuples Extraction from Semi-Structured Data," wi, pp.997-1004, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.