|
| This Article | ||
| ||
| Share | ||
| Bibliographic References | ||
| Add to: | ||
| | ||
| Search | ||
| ||
Automatic Identification of Informative Sections of Web Pages
September 2005 (vol. 17 no. 9)
pp. 1233-1246
| ASCII Text | x | ||
| Sandip Debnath, Prasenjit Mitra, Nirmal Pal, C. Lee Giles, "Automatic Identification of Informative Sections of Web Pages," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1233-1246, September, 2005. | |||
| BibTex | x | ||
| @article{ 10.1109/TKDE.2005.138, author = {Sandip Debnath and Prasenjit Mitra and Nirmal Pal and C. Lee Giles}, title = {Automatic Identification of Informative Sections of Web Pages}, journal ={IEEE Transactions on Knowledge and Data Engineering}, volume = {17}, number = {9}, issn = {1041-4347}, year = {2005}, pages = {1233-1246}, doi = {http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.138}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, } | |||
| RefWorks Procite/RefMan/Endnote | x | ||
| TY - JOUR JO - IEEE Transactions on Knowledge and Data Engineering TI - Automatic Identification of Informative Sections of Web Pages IS - 9 SN - 1041-4347 SP1233 EP1246 EPD - 1233-1246 A1 - Sandip Debnath, A1 - Prasenjit Mitra, A1 - Nirmal Pal, A1 - C. Lee Giles, PY - 2005 KW - Index Terms- Data mining KW - feature extraction or construction KW - text mining KW - Web mining KW - data mining KW - Web page block KW - informative block KW - inverse block document frequency. VL - 17 JA - IEEE Transactions on Knowledge and Data Engineering ER - | |||
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.138
Web pages—especially dynamically generated ones—contain several items that cannot be classified as the "primary content,” e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections” from the other content sections. We call these sections as "Web page blocks” or just "blocks.” First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.
Index Terms:
Index Terms- Data mining, feature extraction or construction, text mining, Web mining, data mining, Web page block, informative block, inverse block document frequency.
Citation:
Sandip Debnath, Prasenjit Mitra, Nirmal Pal, C. Lee Giles, "Automatic Identification of Informative Sections of Web Pages," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1233-1246, Sept. 2005, doi:10.1109/TKDE.2005.138
Usage of this product signifies your acceptance of the Terms of Use.

