This Article 
 Bibliographic References 
 Add to: 
A Framework for Learning Comprehensible Theories in XML Document Classification
January 2012 (vol. 24 no. 1)
pp. 1-14
Jemma Wu, Macquarie University, Australia
XML has become the universal data format for a wide variety of information systems. The large number of XML documents existing on the web and in other information storage systems makes classification an important task. As a typical type of semistructured data, XML documents have both structures and contents. Traditional text learning techniques are not very suitable for XML document classification as structures are not considered. This paper presents a novel complete framework for XML document classification. We first present a knowledge representation method for XML documents which is based on a typed higher order logic formalism. With this representation method, an XML document is represented as a higher order logic term where both its contents and structures are captured. We then present a decision-tree learning algorithm driven by precision/recall breakeven point (PRDT) for the XML classification problem which can produce comprehensible theories. Finally, a semi-supervised learning algorithm is given which is based on the PRDT algorithm and the cotraining framework. Experimental results demonstrate that our framework is able to achieve good performance in both supervised and semi-supervised learning with the bonus of producing comprehensible learning theories.

[1] P. Baldi , P. Frasconi , and P. Smyth , Modeling the Internet and the Web: Probabilistic Methods and Algorithms. Wiley, 2003.
[2] S. Baluja , "Modelling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data," Advances in Neural Information Processing Systems, vol. 11, pp. 854-860, 1998.
[3] T. Berners-Lee , "Semantic Web Stack,", 2007.
[4] A. Blum and T. Mitchell , "Combining Labeled and Unlabeled Data with Co-Training," COLT: Proc. Workshop Computational Learning Theory, pp. 92-100, 1998.
[5] W. Buntine , "Learning Classification Trees," Statistics and Computing, vol. 2, pp. 63-73, 1992.
[6] V. Castelli and T. Cover , "The Relative Value of Labeled and Unlabeled Samples in Pattern Recognition with an Unknown Mixing Parameter," IEEE Trans. Information Theory, vol. 42, no. 6, pp. 2102-2117, Nov. 1996.
[7] Semi-Supervised Learning, O. Chapelle, B. Schölkopf, and A. Zien, eds. MIT Press, 2006.
[8] Y. Chi , S. Nijssen , R. Muntz , and J. Kok , "Frequent Subtree Mining—An Overview," Fundamenta Informaticae, vol. 66, nos. 1/2, pp. 161-198, 2005.
[9] D.A. Cohn , Z. Ghahramani , and M.I. Jordan , "Active Learning with Statistical Models," Advances in Neural Information Processing Systems, G. Tesauro, D. Touretzky, and T. Leen, eds., vol. 7, pp. 705-712, The MIT Press, 1995.
[10] G. Cong , W.S. Lee , H. Wu , and B. Liu , "Semi-Supervised Text Classification Using Partitioned Em," Proc. 11th Int'l Conf. Database Systems for Advanced Applications (DASFAA), pp. 482-493, 2004.
[11] M.W. Craven , "Extracting Comprehensible Models from Trained Neural Networks," PhD thesis, Dept. of Computer Sciences, Univ. of Wisconsin-Madison, 1996.
[12] I. Dagan , Y. Karov , and D. Roth , "Mistake-Driven Learning in Text Categorization," Proc. Second Conf. Empirical Methods in Natural Language Processing, 1997.
[13] A.P. Dempster , "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. Royal Statistical Soc., Series B, vol. 39, no. 1, pp. 1-38, 1977.
[14] L. Denoyer and P. Gallinari , "A Belief Networks-Based Generative Model for Structured Documents: An Application to the XML Categorization," Proc. Third Int'l Conf. Machine Learning and Data Mining, 2003.
[15] S. Dumais , J. Platt , D. Heckerman , and M. Sahami , "Inductive Learning Algorithms and Representations for Text Categorization," Proc. Seventh Int'l Conf. Information and Knowledge Management, pp. 148-155, 1998.
[16] Z. Ghahramani and M.I. Jordan , "Supervised Learning from Incomplete Data via an EM Approach," Advances in Neural Information Processing Systems 6, Morgan Kaufmann, 1994.
[17] S. Giri , A. Chandramouli , and S. Gauch , "XML Classification Using Content and Structure," Technical Report ITTC-FY2007-TR-31020-02, 2007.
[18] S. Goldman and Y. Zhou , "Enhancing Supervised Learning with Unlabeled Data," Proc. 17th Int'l Conf. Machine Learning, pp. 327-334, 2000.
[19] S. Helmer , "Measuring the Structural Similarity of Semistructured Documents Using Entropy," Proc. 33rd Int'l Conf. Very Large Data Bases (VLDB '07), 2007.
[20] H. Hosoya , J. Vouillon , and B.C. Pierce , "Regular Expression Types for XML," ACM SIGPLAN Notices, vol. 35, no. 9, pp. 11-22, 2000.
[21] R. Jin , M. Wu , and R. Sukthankar , "Semi-Supervised Collaborative Text Classification," ECML '07: Proc. 18th European Conf. Machine Learning, pp. 600-607, 2007.
[22] T. Joachims , "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization," Proc. 14th Int'l Conf. Machine Learning (ICML '97), 1997.
[23] A. Karalic , "Producing More Comprehensible Models while Retaining Their Performance," Proc. Information, Statistics and Induction in Science, pp. 54-65, 1996.
[24] S. Kiritchenko and S. Matwin , "Email Classification with Co-Training," Proc. Conf. the Centre for Advanced Studies on Collaborative Research, 2001.
[25] A. Krithara , M.-R. Amini , J.-M. Renders , and C. Goutte , "Semi-Supervised Document Classification with a Mislabeling Error Model," Proc. 30th European Conf. Advances in Information Retrieval (ECIR), pp. 370-381, 2008.
[26] A. Krithara , C. Coutte , J.-M. Renders , and M.R. Amini , "Reducing the Annotation Burden in Text classification," Proc. First Int'l Conf. Multidisciplinary Information Sciences and Technologies (InSciT '06), 2006.
[27] D. Lewis and M. Ringuette , "A Comparison of Two Learning Algorithms for Text Categorization," Proc. Third Ann. Symp. Document Analysis and Information Retrieval (SDAIR '94), 1994.
[28] J.W. Lloyd , "Programming in an Integrated Functional and Logic Language," J. Functional and Logic Programming, vol. 1999, no. 3, pp. 1-49, 1999.
[29] J.W. Lloyd , "Predicate Construction in Higher-Order Logic," Linköping Electronic Articles in Computer and Information Science, vol. 5, pp. 21-51, 2000.
[30] J.W. Lloyd , Logic for Learning: Learning Comprehensible Theories from Structured Data. Springer-Verlag, 2003.
[31] A. McCallum and K. Nigam , "Employing EM and Pool-Based Active Learning for Text Classification," Proc. 15th Int'l Conf. Machine Learning, pp. 359-367, 1998.
[32] T. Mitchell , "The Role of Unlabeled Data in Supervised Learning," Proc. Sixth Int'l Colloquium of Cognitive Science, 1999.
[33] M. Murata , "Hedge Automata: A Formal Model for XML Schemata," Fuji Xerox Information Systems, technical report, 2000.
[34] I.A. Muslea , "Active Learning with Multiple Views," PhD thesis, Univ. of Southern California, Dec. 2002.
[35] A. Nierman and H.V. Jagadish , "Evaluating Structural Similarity in XML Documents," Proc. Int'l Workshop the Web and Databases (WebDB), 2002.
[36] K. Nigam and R. Ghani , "Analyzing the Effectiveness and Applicability of Co-Training," Proc. Ninth Int'l Conf. Information and Knowledge Management, pp. 86-93, 2000.
[37] K. Nigam and R. Ghani , "Understanding the Behavior of Co-Training," Proc. Knowledge Discovery and Data Mining (KDD) Workshop Text Mining, 2000.
[38] K.P. Nigam , "Using Unlabeled Data to Improve Text Classification," PhD thesis, Carnegie Mellon Univ., May 2001.
[39] S. Park and B. Zhang , "Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information," Proc. Seventh Pacific-Asia Conf. Advances in Knowledge Discovery and Data Mining, pp. 88-99, 2003.
[40] M.J. Pazzani , "Comprehensible Knowledge Discovery: Gaining Insight from Data," Proc. First Fed. Data Mining Conf. and Exposition, pp. 73-82, 1997.
[41] D. Pierce and C. Cardie , "Limitations of Co-Training for Natural Language Learning from Large Data Sets," Proc. Conf. Empirical Methods in Natural Language Processing, 2001.
[42] J.R. Quinlan , "Learning First-Order Definitions of Functions," J. Artificial Intelligence Research, vol. 5, pp. 139-161, 1996.
[43] B. Raskutti , H. Ferra , and A. Kowalczyk , "Combining Clustering and Co-Training to Enhance Text Classification using Unlabelled Data," Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 620-625, 2002.
[44] G. Schohn and D. Cohn , "Less Is More: Active Learning with Support Vector Machines," Proc. 17th Int'l Conf. Machine Learning, pp. 839-846, 2000.
[45] F. Sebastiani , "A Tutorial on Automated Text Categorisation," Proc. First Argentinian Symp. Artificial Intelligence (ASAI '99), pp. 7-35, 1999.
[46] M. Seeger , "Learning with Labeled and Unlabeled Data," technical report, Inst. for Adaptive and Neural Computation, Univ. of Edinburgh, 2001.
[47] M. Theobald , R. Schenkel , and G. Weikum , "Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data," Proc. Int'l Workshop the Web and Databases (WebDB), 2003.
[48] C.J. van Rijsbergen , Information Retrieval. Butterworths, 1979.
[49] M. Wallace and C. Runciman , "Haskell and XML: Generic Combinators or Type-Based Translation?," Proc. Int'l Conf. Functional Programming, 1999.
[50] J.X. Wu and J. Zhang , "Knowledge Representation and Learning for Semistructured Data," Technical Report 09/138, CSIRO ICT Centre, 2009.
[51] A. Yamamoto , K. Ito , A. Ishino , and H. Arimura , "Deductive and Inductive Reasoning on Semi-Structured Documents Modelled with Hedges," Proc. 11th Int'l Conf. Inductive Logic Programming, pp. 240-247, 2001.
[52] Y. Yang , "An Evaluation of Statistical Approaches to Text Categorization," ACM Trans. Information Systems, vol. 12, no. 3, pp. 296-333, 1998.
[53] Y. Yang , J.O. Pedersen , and D.H. Fisher , "A Comparative Study on Feature Selection in Text Categorization," Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[54] J. Yi and N. Sundaresan , "A Classifier for Semi-Structured Documents," Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 340-344, 2000.
[55] M.J. Zaki and C.C. Aggarwal , "Xrules: An Effective Structural Classifier for XML Data," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2003.
[56] X. Zhu , "Semi-Supervised Learning Literature Survey," Technical Report 1530, Computer Sciences, Univ. of Wisconsin-Madison, 2008.
[57] X. Zhu and A.B. Goldberg , Introduction to Semi-Supervised Learning. Morgan & Claypool Publishers, 2009.

Index Terms:
XML document, machine learning, knowledge representation, semi-supervised learning.
Jemma Wu, "A Framework for Learning Comprehensible Theories in XML Document Classification," IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 1, pp. 1-14, Jan. 2012, doi:10.1109/TKDE.2011.158
Usage of this product signifies your acceptance of the Terms of Use.