This Article 
 Bibliographic References 
 Add to: 
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
May 2005 (vol. 17 no. 5)
pp. 614-627
To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web Intrapage Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM's practical applicability.

[1] B. Adelberg, “NoDoSE— A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1998.
[2] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, “Efficient Substructure Discovery from Large Semi-structured Data,” Proc. SIAM Int'l Conf. Data Mining (SDM), 2002.
[3] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addision Wesley, 1999.
[4] Z. Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and Its Applications,” Proc. 11th World Wide Web Conf. (WWW), 2002.
[5] A. Broder, S. Glassman, M. Manasse, and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth World Wide Web Conf. (WWW), 1997.
[6] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, “Learning to Construct Knowledge Bases from the World Wide Web,” Artificial Intelligence, vol. 118, nos. 1-2, pp. 69-113, 2000.
[7] S. Chakrabarti, “Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction,” Proc. 10th World Wide Web Conf. (WWW), 2001.
[8] Y. Chen, W.-Y. Ma, and H.-J. Zhang, “Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices,” Proc. 12th World Wide Web Conf. (WWW), 2003.
[9] W. Cohen, “Recognizing Structure in Web Pages Using Similarity Queries,” Proc. Nat'l Conf. Artificial Intelligence (AAAI), 1999.
[10] G. Cong, L. Yi, B. Liu, and K. Wang, “Discovering Frequent Substructures from Hierarchical Semi-Structured Data,” Proc. SIAM Int'l Conf. Data Mining (SIAM SDM), 2002.
[11] R. Cooley and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web,” Proc. Ninth IEEE Int'l Conf. Tools with Artificial Intelligence (ICTAI), 1997.
[12] D.W. Embley, Y. Jiang, and Y.K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1999.
[13] K. Furukawa, T. Uchida, K. Yamada, T. Miyahara, T. Shoudai, and Y. Nakamura, “Extracting Characteristic Structures among Words in Semistructured Documents,” Proc. Sixth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2002.
[14] H. Grundel, T. Naphtali, C. Wiech, J.-M. Gluba, M. Rohdenburg, and T. Scheffer, “Clipping and Analyzing News Using Machine Learning Techniques,” Proc. Int'l Conf. Discovery Science, 2001.
[15] C.N. Hsu and M.T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[16] H.-Y. Kao, S.H. Lin, J.M. Ho, and M.-S. Chen, “Entropy-Based Link Analysis for Mining Web Informative Structures,” Proc. ACM 11th Int'l Conf. Information and Knowledge Management (CIKM), 2002.
[17] H.-Y. Kao, S.-H. Lin, J.-M. Ho, and M.-S. Chen, “Mining Web Information Structures and Contents Based on Entropy Analysis,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, Jan. 2004.
[18] J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), 1998.
[19] N. Kushmerick, D. Weld, and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proc. 15th Int'l Joint Conf. Artificial Intelligence (IJCAI), 1997.
[20] A. Laender, B. Ribeiro-Neto, A. Silva, and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, June 2002.
[21] S.H. Lin and J.M. Ho, “Discovering Informative Content Blocks from Web Documents,” Proc. Eighth ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), 2002.
[22] W.Y. Lin and W. Lam, “Learning to Extract Hierarchical Information from Semi-Structured Documents,” Proc. ACM Ninth Int'l Conf. Information and Knowledge Management (CIKM), 2000.
[23] X. Li, B. Liu, T.-H. Phang, and M. Hu, “Using Micro Information Units for Internet Search,” Proc. ACM 11th Int'l Conf. Information and Knowledge Management (CIKM), 2002.
[24] T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda, “Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents,” Proc. Sixth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2002.
[25] C.E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J., vol. 27, pp. 398-403, 1948.
[26] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, 1989.
[27] W3C DOM, Document Object Model (DOM), http://www.w3.orgDOM/, 2005.
[28] K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Trans. Knowledge and Eng., vol. 12, no. 3, May/June 2000.
[29] C. Yip, C. Gertz, and N. Sundaresan, “Reverse Engineering for Web Data: From Visual to Semantic Structures,” Proc. 19th IEEE Int'l Conf. Data Eng. (ICDE), 2002.

Index Terms:
Intrapage informative structure, DOM, entropy, information extraction.
Hung-Yu Kao, Jan-Ming Ho, Ming-Syan Chen, "WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 5, pp. 614-627, May 2005, doi:10.1109/TKDE.2005.84
Usage of this product signifies your acceptance of the Terms of Use.