The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.07 - July (2009 vol.21)
pp: 1027-1041
Marta Capdevila , University of Vigo, Vigo
Oscar W. Márquez Flórez , University of Vigo, Vigo
ABSTRACT
The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization (ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).
INDEX TERMS
Data communications, text processing, data compaction and compression, clustering, classifier design and evaluation, feature evaluation and selection.
CITATION
Marta Capdevila, Oscar W. Márquez Flórez, "A Communication Perspective on Automatic Text Categorization", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 7, pp. 1027-1041, July 2009, doi:10.1109/TKDE.2009.22
REFERENCES
[1] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp.1-47, 2002.
[2] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning (ECML), pp.137-142, 1998.
[3] T. Joachims, Learning to Classify Text Using Support Vector Machines—Methods, Theory, and Algorithms. Kluwer/Springer, 2002.
[4] L.D. Baker and A.K. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. Special Interest Group on Information Retrieval (SIGIR '98) 21st ACM Int'l Conf. Research and Development in Information Retrieval, pp.96-103, 1998.
[5] N. Slonim and N. Tishby, “The Power of Word Clusters for Text Classification,” Proc. 23rd European Colloquium on Information Retrieval Research, 2001.
[6] I. Dhillon, S. Mallela, and R. Kumar, “A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification,” J.Machine Learning Research (JMLR), special issue on variable and feature selection, vol.3, pp.1265-1287, 2003.
[7] R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter, “Distributional Word Clusters vs. Words for Text Categorization,” J.Machine Learning Research, vol. 3, pp. 1183-1208, 2003.
[8] A. McCallum and K. Nigam, “A Comparison of Event Models for Naive Bayes Text Classification,” Proc. Assoc. for the Advancement of Artificial Intelligence (AAAI '98) Workshop Learning for Text Categorization, 1998.
[9] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. 22nd Ann. Int'l ACM Special Interest Group on Information Retrieval Conf. (SIGIR '99), pp.42-49, Aug. 1999.
[10] S. Haykin, Communication Systems. John Wiley & Sons, 2001.
[11] T.M. Cover and J.A. Thomas, Elements of Information Theory, second ed. John Wiley & Sons, Inc., 2006.
[12] H. Schutze, D. Hull, and J. Pedersen, “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proc. 18th Ann. Int'l ACM Special Interest Group on Information Retrieval (SIGIR '95) Conf. Research and Development in Information Retrieval, pp.229-237, 1995.
[13] T. Li, S. Zhu, and M. Ogihara, “Using Discriminant Analysis for Multi-Class Classification: An Experimental Investigation,” Knowledge and Information Systems, vol. 10, no. 4, pp.453-472, 2006.
[14] I.H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, second ed. Morgan Kaufmann, 2005.
[15] F. Debole and F. Sebastiani, “An Analysis of the Relative Hardness of Reuters-21578 Subsets,” Proc. Fourth Int'l Conf. Language Resources and Evaluation (LREC '04), pp.971-974, 2004.
[16] K. Torkkola, “Linear Discriminant Analysis in Document Classification,” Proc. IEEE Int'l Conf. Data Mining (ICDM-2001) Workshop Text Mining (TextDM '01), 2001.
[17] Y. Yang and J.O. Pedersen, “A Comparison Study on Feature Selection in Text Categorization,” Proc. Int'l Conf. Machine Learning, pp.412-420, 1997.
20 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool