This Article 
 Bibliographic References 
 Add to: 
Mining Web Informative Structures and Contents Based on Entropy Analysis
January 2004 (vol. 16 no. 1)
pp. 41-55
Shian-Hua Lin, IEEE Computer Society

Abstract—In this paper, we study the problem of mining the informative structure of a news Web site that consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by these TOC pages. Based on the Hyperlink Induced Topics Search (HITS) algorithm, we propose an entropy-based analysis (LAMIS) mechanism for analyzing the entropy of anchor texts and links to eliminate the redundancy of the hyperlinked structure so that the complex structure of a Web site can be distilled. However, to increase the value and the accessibility of pages, most of the content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, copy announcements, etc. To further eliminate such redundancy, we propose another mechanism, called InfoDiscoverer, which applies the distilled structure to identify sets of article pages. InfoDiscoverer also employs the entropy information to analyze the information measures of article sets and to extract informative content blocks from these sets. Our result is useful for search engines, information agents, and crawlers to index, extract, and navigate significant information from a Web site. Experiments on several real news Web sites show that the precision and the recall of our approaches are much superior to those obtained by conventional methods in mining the informative structures of news Web sites. On the average, the augmented LAMIS leads to prominent performance improvement and increases the precision by a factor ranging from 122 to 257 percent when the desired recall falls between 0.5 and 1. In comparison with manual heuristics, the precision and the recall of InfoDiscoverer are greater than 0.956.

[1] B. Amento, L. Terveen, and W. Hill, Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Documents Proc. 23th ACM SIGIR, 2000.
[2] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addision Wesley, 1999.
[3] K. Bharat and M.R. Henzinger, Improved Algorithms for Topic Distillation in a Hyperlinked Environment Proc. 21st ACM SIGIR, 1998.
[4] K. Bharat and A. Broder, Mirror and Mirror and on the Web: A Study of Host Pairs with Replicated Content Proc. Eighth Int'l World Wide Web Conf., May 1999.
[5] K. Bharat, A. Broder, J. Dean, and M.R. Henzinger, A Comparison of Techniques to Find Mirrored Hosts on the WWW IEEE Data Eng. Bull., vol. 23, no. 4, pp. 21-26, 2000.
[6] A. Borodin, G.O. Roberts, J.S. Rosenthal, and P. Tsaparas, Finding Authorities and Hubs from Link Structures on the World Wide Web Proc. 10th World Wide Web Conf., 2001.
[7] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine Proc. Seventh World Wide Web Conf., 1998.
[8] A. Broder, S. Glassman, M. Manasse, and G. Zweig, Syntactic Clustering of the Web Proc. Sixth World Wide Web Conf., 1997.
[9] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph Structure in the Web Proc. Ninth World Wide Web Conf., 2000.
[10] C. Cardie, Empirical Methods in Information Extraction AI Magazine, vol. 18, no. 4, pp. 5-79, 1997.
[11] S. Chakrabarti, M. Joshi, and V. Tawde, Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks Proc. 24th ACM SIGIR, 2001.
[12] S. Chakrabarti, Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction Proc. 10th World Wide Web Conf., 2001.
[13] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J.M. Kleinberg, Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text Proc. Seventh World Wide Web Conf., 1998.
[14] S. Cbakrabarti, B.E. Dom, D. Gibson, and J. Kleinberg, Mining the Web's Link Structure Computer, vol. 32, no. 8, pp. 60-67, Aug. 1999.
[15] B. Chidlovskii, Wrapper Generation by k-Reversible Grammar Induction Proc. Workshop Machine Learning for Information Extraction, Aug. 2000.
[16] C.H. Chang and S.C. Lui, IEPAD: Information Extraction Based on Pattern Discovery Proc. 10th World Wide Web Conf., 2001.
[17] M.-S. Chen, J.-S. Park, and P.S. Yu, “Efficient Data Mining for Path Traversal Patterns,” IEEE Trans. Knowledge and Data Eng., vol. 10, no. 2, pp. 209-221, Apr. 1998.
[18] L.F. Chien, PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, 1997.
[19] V. Crescenzi, G. Mecca, and P. Merialdo, RoadRunner: Towards Automatic Data Extraction from Large Web Sites Proc. 27th Int'l Conf. Very Large Data Bases, 2001.
[20] B.D. Davison, Recognizing Nepotistic Links on the Web Proc. Nat'l Conf. Artificial Intelligence (AAAI), 2000.
[21] D. Freitag, Machine Learning for Information Extraction PhD Dissertation, Computer Science Dept., Carnegie Mellon Univ., Pittsburgh, PA, 1998.
[22] C.N. Hsu and M.T. Dung, Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[23] N. Jushmerick, Learning to Remove Internet Advertisements Proc. Third Int'l Conf. Autonomous Agents, 1999.
[24] H.Y. Kao, S.H. Lin, J.M. Ho, and M.S. Chen, Entropy-Based Link Analysis for Mining Web Informative Structures Proc. The 11th ACM CIKM, 2002.
[25] J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environment Proc. ACM-SIAM Symp. Discrete Algorithms, 1998.
[26] N. Kushmerick, D. Weld, and R. Doorenbos, Wrapper Induction for Information Extraction Proc. 15th Int'l Joint Conf. Artificial Intelligence (IJCAI), 1997.
[27] R. Lempel and S. Moran, The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect Proc. Ninth Int'l World Wide Web Conf., 2000.
[28] W.S. Li, N.F. Ayan, O. Kolak, and Q. Vu, Constructing Multi-Granular and Topic-Focused Web Site Maps Proc. 10th World Wide Web Conf., 2001.
[29] S.H. Lin and J.M. Ho, Discovering Informative Content Blocks from Web Documents Proc. Eighth ACM SIGKDD, 2002.
[30] J.C. Miller, G. Rae, and F. Schaefer, Modifications of Kleinberg's HITS Algorithm Using Matrix Exponentiation and Web Log Records Proc. 24th ACM SIGIR Conf. Research and Development in Information Retrieval, 2001.
[31] I. Muslea, S. Minton, and C. Knoblock, A Hierarchical Approach to Wrapper Induction Proc. Third Int'l Conf. Autonomous Agents (Agents '99), 1999.
[32] P. Pirolli, J. Pitkow, and R. Rao, Silk from a Sow's Ear: Extracting Usable Structures from the Web Proc. ACM SIGCHI Conf. Human Factors in Computing, 1996.
[33] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, 1989.
[34] C.E. Shannon, A Mathematical Theory of Communication Bell System Technical J., vol. 27, pp. 398-403, 1948.
[35] K. Wang and H. Liu, Discovering Structural Association of Semistructured Data IEEE Trans. Knowledge and Data Eng., vol. 12, no. 3, pp. 353-371, 2000.
[36] W3C DOM, Document Object Model (DOM),http://www.w3.orgDOM/, 2003.

Index Terms:
Informative structure, link analysis, hubs and authorities, anchor text, entropy, information extraction.
Hung-Yu Kao, Shian-Hua Lin, Jan-Ming Ho, Ming-Syan Chen, "Mining Web Informative Structures and Contents Based on Entropy Analysis," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp. 41-55, Jan. 2004, doi:10.1109/TKDE.2004.1264821
Usage of this product signifies your acceptance of the Terms of Use.