Issue No. 07 - July (2009 vol. 21)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.22
Marta Capdevila , University of Vigo, Vigo
Oscar W. Márquez Flórez , University of Vigo, Vigo
The basic concern of a Communication System is to transfer information from its source to a destination some distance away. Textual documents also deal with the transmission of information. Particularly, from a text categorization system point of view, the information encoded by a document is the topic or category it belongs to. Following this initial intuition, a theoretical framework is developed where Automatic Text Categorization (ATC) is studied under a Communication System perspective. Under this approach, the problematic indexing feature space dimensionality reduction has been tackled by a two-level supervised scheme, implemented by a noisy terms filtering and a subsequent redundant terms compression. Gaussian probabilistic categorizers have been revisited and adapted to the concomitance of sparsity in ATC. Experimental results pertaining to 20 Newsgroups and Reuters-21578 collections validate the theoretical approaches. The noise filter and redundancy compressor allows an aggressive term vocabulary reduction (reduction factor greater than 0.99) with a minimum loss (lower than 3 percent) and, in some cases, gain (greater than 4 percent) of final classification accuracy. The adapted Gaussian Naive Bayes classifier reaches classification results similar to those obtained by state-of-the-art Multinomial Naive Bayes (MNB) and Support Vector Machines (SVMs).
Data communications, text processing, data compaction and compression, clustering, classifier design and evaluation, feature evaluation and selection.
M. Capdevila and O. W. Márquez Flórez, "A Communication Perspective on Automatic Text Categorization," in IEEE Transactions on Knowledge & Data Engineering, vol. 21, no. , pp. 1027-1041, 2009.