loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 19th International Conference on Database and Expert Systems Application
Text Extraction from the Web via Text-to-Tag Ratio
September 01-September 05
ISBN: 978-0-7695-3299-8
We describe a method to extract content text from diverse Web pages by using the HTML document's Text-To-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-To-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Index Terms:
Information Extraction, Web, Histogram
Citation:
Tim Weninger, William H. Hsu, "Text Extraction from the Web via Text-to-Tag Ratio," dexa, pp.23-28, 2008 19th International Conference on Database and Expert Systems Application, 2008
Usage of this product signifies your acceptance of the Terms of Use.