This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
A New Text Categorization Technique Using Distributional Clustering and Learning Logic
September 2006 (vol. 18 no. 9)
pp. 1156-1165
Text categorization is continuing to be one of the most researched NLP problems due to the ever-increasing amounts of electronic documents and digital libraries. In this paper, we present a new text categorization method that combines the distributional clustering of words and a learning logic technique, called Lsquare, for constructing text classifiers. The high dimensionality of text in a document has not been fruitful for the task of categorization, for which reason, feature clustering has been proven to be an ideal alternative to feature selection for reducing the dimensionality. We, therefore, use distributional clustering method (IB) to generate an efficient representation of documents and apply Lsquare for training text classifiers. The method was extensively tested and evaluated. The proposed method achieves higher or comparable classification accuracy and {\rm F}_1 results compared with SVM on exact experimental settings with a small number of training documents on three benchmark data sets WebKB, 20Newsgroup, and Reuters-21578. The results prove that the method is a good choice for applications with a limited amount of labeled training data. We also demonstrate the effect of changing training size on the classification performance of the learners.

[1] H. Al-Mubaid and S. Nagula, “Machine Learning Approach for Context-Sensitive Error Detection,” Proc. Int'l Conf. Intelligent Computing and Information Systems (ICICIS '05), 2005.
[2] H. Al-Mubaid and K. Truemper, “Learning to Find Context-Based Spelling Errors,” Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, 2006.
[3] L.D. Baker and A.K. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 1998.
[4] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional Word Clusters vs Words for Text Categorization,” J. Machine Learning Research, vol. 3, 2003.
[5] B.E. Boser, I. Guyon, and V. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. Ann. Workshop Computational Learning Theory (COLT '92), pp. 144-152, 1992.
[6] M. Craven, D. DiPasquo, D. Freitag, A.K. McCallum, T.M. Mitchell, K. Nigam, and S. Slattery, “Learning to Extract Symbolic Knowledge from the World Wide Web,” Proc. Nat'l Conf. Artificial Intelligence (AAAI '98), 1998.
[7] I. Dhillon, S. Mallela, and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,” J. Machine Learning Research, vol. 3, 2003.
[8] S.T. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive Learning Algorithms and Representations for Text Categorization,” Proc. Seventh Int'l Conf. Information and Knowledge Management, 1998.
[9] G. Felici, F. Sun, and K. Truemper, “A Method for Controlling Errors in Two-Class Classification,” Proc. 23rd Ann. Int'l Computer Software and Applications Conf. (COMPSAC-99), 1999.
[10] G. Felici and K. Truemper, “A Minsat Approach for Learning in Logic Domains,” Informs J. Computing, vol. 14, no. 1, Winter 2002.
[11] P.A. Flach, “On the Logic of Hypothesis Generation,” Applied Logic Series, vol. 18, chapter 6, pp. 89-106, 2000.
[12] G. Forman, “An Extensive Empirical Study of Feature Selection Metrics for Text Classification,” J. Machine Learning Research, vol. 3, 2003.
[13] M Hearstet al., Xerox TREC4 site report, TREC 4, 1996.
[14] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, Aug. 1999.
[15] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning (ECML '98), 1998.
[16] T. Joachims, “A Statistical Learning Model of Text Classification with Support Vector Machines,” Proc. Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 2001.
[17] D. Koller and M. Sahami, “Hierarchically Classifying Documents Using Very Few Words,” Proc. Int'l Conf. Machine Learning (ICML '97), 1997.
[18] T.M. Mitchell, Machine Learning. McGraw-Hill, 1997.
[19] K. Nigam, A.K. McCallum, S. Thrun, and T.M. Mitchell, “Learning to Classify Text from Labeled and Unlabeled Documents,” Proc. Nat'l Conf. Artificial Intelligence (AAAI '98), 1998.
[20] F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English Words,” Proc. 31st Ann. Meeting of the ACL, pp. 183-190, 1993.
[21] Reuters-21578: http://www.daviddlewis.com/resources/testcol lections reuters21578/, 2004.
[22] M.E. Ruiz and P. Srinivisan, “Hierarchical Neural Networks for Text Categorization,” Proc. Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, 1999.
[23] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[24] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” Proc. 23rd European Colloquium on Information Retrieval Research (ECIR-01), 2001.
[25] N. Tishby, F.C. Pereira, and W. Bialek, “The Information Bottleneck Method,” Proc. 37th Ann. Allerton Conf. Comm., Control, and Computing, 1999.
[26] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[27] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. ACM Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 42-49, 1999.
[28] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[29] Z. Zheng and R. Srihari, “Optimally Combining Positive and Negative Features for Text Categorization,” Proc. Workshop Learning from Imbalanced Data Sets, 2003.
[30] 20 News Group: http://kdd.ics.uci.edu/databases/20news groups 20newsgroups.htm, 2004.

Index Terms:
Text categorization, feature selection, machine learning.
Citation:
Hisham Al-Mubaid, Syed A. Umair, "A New Text Categorization Technique Using Distributional Clustering and Learning Logic," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 9, pp. 1156-1165, Sept. 2006, doi:10.1109/TKDE.2006.135
Usage of this product signifies your acceptance of the Terms of Use.