Issue No. 05 - May (2005 vol. 17)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.84
Jan-Ming Ho , IEEE
Ming-Syan Chen , IEEE
To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web Intrapage Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM's practical applicability.
Intrapage informative structure, DOM, entropy, information extraction.
Jan-Ming Ho, Ming-Syan Chen, Hung-Yu Kao, "WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model", IEEE Transactions on Knowledge & Data Engineering, vol. 17, no. , pp. 614-627, May 2005, doi:10.1109/TKDE.2005.84