Issue No. 09 - September (2008 vol. 20)

ISSN: 1041-4347

pp: 1264-1272

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.76

V.P. Kallimani , University of Nottingham, Malaysia Campus, Semenyih

Lam H. Lee , University of Nottingham, Semenyih

Dino Isa , The University of Nottingham, Malaysia Campus, Semenyih

ABSTRACT

This work implements an enhanced hybrid classification method through the utilization of the na?ve Bayes classifier and the Support Vector Machine (SVM). In this project, the Bayes formula was used to vectorize (as opposed to classify) a document according to a probability distribution reflecting the probable categories that the document may belong to. The Bayes formula gives a range of probabilities to which the document can be assigned according to a pre determined set of topics such as those found in the "20 newsgroups" dataset for instance. Using this probability distribution as the vectors to represent the document, the SVM can then be used to classify the documents on a multi ? dimensional level. The effects of an inadvertent dimensionality reduction caused by classifying using only the highest probability using the na?ve Bayes classifier can be overcome using the SVM by employing all the probability values associated with every category for each document. This method can be used for any dataset and shows a significant reduction in training time as compared to the LSquare method and significant improvement in classification accuracy when compared to pure na?ve Bayes systems and also the TF-IDF/SVM hybrids.

INDEX TERMS

Document and Text Processing < Computing Methodologies, Document analysis, Classifier design and evaluation < Design Methodology < Pattern Recognition < Computing Methodologies, Text processing < Applications < Pattern Recognition < Computing Methodologies

CITATION

R. RajKumar, V.P. Kallimani, Lam H. Lee, Dino Isa, "Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine",

*IEEE Transactions on Knowledge & Data Engineering*, vol. 20, no. , pp. 1264-1272, September 2008, doi:10.1109/TKDE.2008.76SEARCH