This Article 
 Bibliographic References 
 Add to: 
Learning Object Models from Semistructured Web Documents
March 2006 (vol. 18 no. 3)
pp. 334-349
This paper presents an automated approach to learning object models by means of useful object data extracted from data-intensive semistructured web documents such as product descriptions. Modeling intensive data on the Web involves the following three phrases: First, we identify the object region covering the descriptions of object data when irrelevant contents from the web documents are excluded. Second, we partition the contents of different object data appearing in the object region and construct object data using hierarchical XML outputs. Third, we induce the abstract object model from the analogous object data. This model will match the corresponding object data from a Web site more precisely and comprehensively than the existing handcrafted ontologies. The main contribution of this study is in developing a fully automated approach to extract object data and object model from semistructured web documents using kernel-based matching and View Syntax interpretation. Our system, OnModer, can automatically construct object data and induce object models from complicated web documents, such as the technical descriptions of personal computers and digital cameras downloaded from manufacturers' and vendors' sites. A comparison with the available hand-crafted ontologies and tests on an open corpus demonstrate that our framework is effective in extracting meaningful and comprehensive models.

[1] A. Arasu and H. Garcia-Molina, “Extracting Structured Data from Web Pages,” Proc. ACM SIGMOD, pp. 337-348, 2003.
[2] Z. Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and Its Applications,” Proc. World Wide Web Conf., pp. 580-591, 2002.
[3] D. Buttler, L. Liu, and C. Pu, “A Fully Automated Extraction System for the World Wide Web,” Proc. IEEE Int'l Conf. Distributed Computing Systems (ICDCS), p. 361, 2001.
[4] D. Buttler, L. Liu, C. Pu, H. Paques, W. Han, and W. Tang, “OminiSearch: A Method for Searching Dynamic Content on the Web,” Proc. ACM SIGMOD, p. 604, 2001.
[5] R. Cole and P. Eklund, “Browsing Semistructured Web Texts Using Formal Concept Analysis,” Lecture Notes Computer Science, vol. 2120, p. 319, 2001.
[6] H. Davulcu, S. Vadrevu, and S. Nagarajan, “Ontominer: Bootstrapping and Populating Ontologies from Domain Specific WebSites,” Proc. Int'l Workshop Semantic Web and Databases (SWDB), pp. 259-276, 2003.
[7] A. Doan, P. Domingos, and A. Levy, “Learning Source Descriptions for Data Integration,” Proc. WebDB, pp. 81-86, 2000.
[8] D.W. Embley et al., “Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents,” Proc. Conf. Information and Knowledge Management, pp. 52-59, 1998.
[9] F. Esposito, S. Ferilli, N. Fanizzi, and G. Semeraro, “Learning from Parsed Sentences with INTHELEX,” Proc. Learning Language in Logic Workshop, pp. 194-198, 2000.
[10] O. Etzioni, M. Cafarell, “Methods for Domain-Independent Information Extraction from the Web: An Experimental Comparison,” Proc. Nat'l Conf. Artifical Intelligence, pp. 391-398, 2004.
[11] T. Gärtner,“A Survey of Kernels for Structured Data,” Newsletter ACM SIGKDD, vol. 5, no. 1, pp. 49-58, July 2003.
[12] S. Gupta et al., “Adapting Content to Mobile Devices: DOM-Based Content Extraction of HTML Documents,” Proc. World Wide Web Conf., pp. 207-214, 2003.
[13] U. Hahn and K. Schnattinger, “Towards Text Knowledge Engineering,” Proc. Nat'l Conf. Artificial Intelligence. pp. 524-531, 1998.
[14] K. Lerman, C. Knoblock, and S. Minton, “Automatic Data Extraction from Lists and Tables in Web Sources,” Proc. IJCAI Workshop Adaptive Text Extraction and Mining, 2001.
[15] S.-H. Lin and J.-M. Ho, “Discovering Informative Content Blocks from Web Documents,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 588-593, 2002.
[16] B. Liu, R. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 49-55, 2003.
[17] J.-U. Kietz and K. Morik, “A Polynomial Approach to the Constructive Induction of Structural Knowledge,” Machine Learning, vol. 14, no. 2, pp. 193-217, 1994.
[18] N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artifical Intelligence, vol. 118, pp. 15-68, 2000.
[19] A. Maedche, Ontology Learning for the Semantic Web. Kluwer Academic Publishers, 2002.
[20] A. Maedche and S. Staab, “Semiautomatic Engineering of Ontologies from Text,” Proc. Int'l Conf. Software Eng. & Knowledge Eng., 2000.
[21] A. Maedche and S. Staab, “Ontology Learning for the Semantic Web,” IEEE Intelligent Systems, vol. 16, no. 2, pp. 72-79, 2001.
[22] K.R. Müller et al., “An Introduction to Kernel-Based Learning Algorithms,” IEEE Neural Networks, vol. 12, no. 2, pp. 181-201, 2001.
[23] S. Schlobach, “Assertional Mining in Description Logics,” Proc. Int'l Workshop Description Logics (DL-2000), pp. 237-246, 2000.
[24] K. Shimada, A. Fukumoto, and T. Endo, “Information Extraction from Personal Computer Specifications on the Web Using a User's Request,” IEICE Information and Systems, pp. 1386-1395, 2003.
[25] S. Soderland, “Learning to Extract Text-Based Information from the World Wide Web,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 251-254, 1997.
[26] R. Studer, R. Benjamins, and D. Fensel, “Knowledge Engineering: Principles and Methods,” Data & Knowledge Eng., vol. 25, no. 102, pp. 161-197, 1998.
[27] R. Song et al., “Learning Block Importance Models for Web Pages,” Proc. World Wide Web Conf., pp. 203-211, 2004.
[28] M. Thelen and E. Riloff, “A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts,” Proc. Conf. Empirical Methods in Natural Language Processing (EMNLP), 2002.
[29] A. Todirascu, F. de Beuvron, D. Gâlea, F. Rousselot, “Using Description Logics for Ontology Extraction,” Proc. Workshop Ontology Learning of ECAI, pp. 31-37, 2000.
[30] M. Uschold, “Knowledge Level Modeling: Concepts and Terminology,” The Knowledge Eng. Rev., vol. 13, no, 1, pp. 5-29, 1998.
[31] R. Volz et al., “Unveiling the Hidden Bride: Deep Annotation for Mapping and Migrating Legacy Data to the Semantic Web,” Web Semantics: Science, Services, and Agents on the World Wide Web, pp. 187-206, 2004.
[32] G. Webb, J. Wells, Z. Zheng, “An Experimental Evaluation of Integrating Machine Learning with Knowledge Acquisition,” Machine Learning, vol. 31, no. 1, pp. 5-23, 1999.
[33] S. Ye and T.-S. Chua, “Learning Object Model from Product Web Pages,” Proc. Workshop Semantic Web of SIGIR, pp. 69-80, 2004.
[34] S. Ye and T-S Chua, “Detecting and Partitioning of Data Objects in Complex Web Pages,” Proc. Int'l Conf. Web Intelligence, pp. 669-672, 2004.
[35] L. Yi and B. Liu, “Eliminating Noisy Information in Web Pages for Data Mining,” Proc. 2003 ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 296-305, 2003.

Index Terms:
Index Terms- Web mining, machine learning, intelligent web services and Semantic Web, web text analysis, knowledge acquisition, ontology design, computational geometry and object modeling, DOM.
Shiren Ye, Tat-Seng Chua, "Learning Object Models from Semistructured Web Documents," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 3, pp. 334-349, March 2006, doi:10.1109/TKDE.2006.47
Usage of this product signifies your acceptance of the Terms of Use.