2008 19th International Conference on Database and Expert Systems Application Text Extraction from the Web via Text-to-Tag Ratio September 01-September 05 ISBN: 978-0-7695-3299-8
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DEXA.2008.12
We describe a method to extract content text from diverse Web pages by using the HTML document's Text-To-Tag Ratio rather than specific HTML cues that may not be constant across various Web pages. We describe how to compute the Text-To-Tag Ratio on a line-by-line basis and then cluster the results into content and non-content areas. With this approach we then show surprisingly high levels of recall for all levels of precision, and a large space savings.
Index Terms:
Information Extraction, Web, Histogram
Citation:
Tim Weninger, William H. Hsu, "Text Extraction from the Web via Text-to-Tag Ratio," dexa, pp.23-28, 2008 19th International Conference on Database and Expert Systems Application, 2008 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||