loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2009 International Conference on Advanced Information Networking and Applications Workshops
CUTER: An Efficient Useful Text Extraction Mechanism
Bradford, United Kingdom
May 26-May 29
ISBN: 978-0-7695-3639-2
In this paper we present CUTER, a system that processes HTML pages in order to extract the useful text from them. The mechanism is focalized on HTML pages that include news articles from major portals and blogs. As useful text we define the body of the article that contains the news report. In order to extract the body of the article we deconstruct the HTML page to its DOM model and we apply a set of algorithms in order to clean and correct the HTML code, locate and characterize each node of the DOM model and finally store the text from the nodes that are characterized as useful text nodes. CUTER is a subsystem of peRSSonal, a web tool that is used to obtain news articles from all over the world, process them and present them back to the end users in a personalized manner. The role of CUTER is to feed peRSSonal with the body of the. In this paper we present the basic algorithms and experimental results on the efficiency of the CUTER text extractor.
Index Terms:
Text extraction, HTML analysis, DOM analysis, useful text
Citation:
George Adam, Christos Bouras, Vassilis Poulopoulos, "CUTER: An Efficient Useful Text Extraction Mechanism," waina, pp.703-708, 2009 International Conference on Advanced Information Networking and Applications Workshops, 2009
Usage of this product signifies your acceptance of the Terms of Use.