loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
2008 19th International Conference on Database and Expert Systems Application
Content Code Blurring: A New Approach to Content Extraction
September 01-September 05
ISBN: 978-0-7695-3299-8
Most HTML documents on the World Wide Web contain far more than the article or text which forms their main content.??Navigation menus, functional and design elements or commercial banners are typical examples of additional contents.??Content Extraction is the process of identifying the main content and/or removing the additional contents. We introduce content code blurring, a novel Content Extraction algorithm.??As the main text content is typically a long, homogeneously formatted region in a web document, the aim is to identify exactly these regions in an iterative process.??Comparing its performance with existing Content Extraction solutions we show thatfor most documents content code blurring delivers the best results.
Index Terms:
Content Extraction, content code blurring, web information retrieval, main content detection
Citation:
Thomas Gottron, "Content Code Blurring: A New Approach to Content Extraction," dexa, pp.29-33, 2008 19th International Conference on Database and Expert Systems Application, 2008
Usage of this product signifies your acceptance of the Terms of Use.