This Article 
 Bibliographic References 
 Add to: 
Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine
September 2008 (vol. 20 no. 9)
pp. 1264-1272
Dino Isa, The University of Nottingham, Malaysia Campus, Semenyih
Lam H. Lee, University of Nottingham, Semenyih
V.P. Kallimani, University of Nottingham, Malaysia Campus, Semenyih
This work implements an enhanced hybrid classification method through the utilization of the na?ve Bayes classifier and the Support Vector Machine (SVM). In this project, the Bayes formula was used to vectorize (as opposed to classify) a document according to a probability distribution reflecting the probable categories that the document may belong to. The Bayes formula gives a range of probabilities to which the document can be assigned according to a pre determined set of topics such as those found in the "20 newsgroups" dataset for instance. Using this probability distribution as the vectors to represent the document, the SVM can then be used to classify the documents on a multi ? dimensional level. The effects of an inadvertent dimensionality reduction caused by classifying using only the highest probability using the na?ve Bayes classifier can be overcome using the SVM by employing all the probability values associated with every category for each document. This method can be used for any dataset and shows a significant reduction in training time as compared to the LSquare method and significant improvement in classification accuracy when compared to pure na?ve Bayes systems and also the TF-IDF/SVM hybrids.

[1] B. Kamens, Bayesian Filtering: Beyond Binary Classification. Fog Creek Software, 2005.
[2] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian Approach to Filtering Junk E-Mail,” Proc. AAAI Workshop Learning for Text Categorization, 1998.
[3] S.J. Delany, P. Cunningham, and L. Coyle, “An Assessment of Case-Based Reasoning for Spam Filtering,” Artificial Intelligence J., vol. 24, nos. 3-4, pp. 359-378, 2005.
[4] P. Cunningham, N. Nowlan, S.J. Delany, and M. Haahr, “A Case-Based Approach in Spam Filtering that Can Track Concept Drift,” Proc. ICCBR Workshop Long-Lived CBR Systems, 2003.
[5] A. McCallum and K. Nigam, “A Comparison of Event Models for Naïve Bayes Text Classification,” J. Machine Learning Research 3, pp. 1265-1287, 2003.
[6] S.J. Delany, P. Cunningham, A. Tsymbal, and L. Coyle, “A Case-Based Technique for Tracking Concept Drift in Spam Filtering,” J.Knowledge Based Systems, vol. 18, nos. 4-5, pp. 187-195, 2004.
[7] S. Block, D. Medin, and D. Osherson, “Probability from Similarity,” technical report, Northwestern Univ., Rice Univ., 2002.
[8] P.A. Flach, E. Gyftodimos, and N. Lachiche, “Probabilistic Reasoning with Terms,” technical report, Univ. of Bristol, Louis Pasteur Univ., 2002.
[9] K. Nigam, J. Lafferty, and A. McCallum, “Using Maximum Entropy for Text Classification,” Proc. IJCAI Workshop Machine Learning for Information Filtering, pp. 61-67, 1999.
[10] Y. Xia, W. Liu, and L. Guthrie, “Email Categorization with Tournament Methods,” Proc. Int'l Conf. Application of Natural Language (NLDB), 2005.
[11] X. Su, “A Text Categorization Perspective for Ontology Mapping,” technical report, Dept. of Computer and Information Science, Norwegian Univ. of Science and Tech nology, 2002.
[12] J.R. Quinlan, C4.5: Program for Machine Learning. Morgan Kaufmann, 1993.
[13] E.H. Han, G. Karypis, and V. Kumar, “Text Categorization Using Weight Adjusted k-Nearest Neighbour Classification,” Dept. of Computer Science and Eng., Army HPC Research Center, Univ. of Minnesota, 1999.
[14] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning (ECML '98), pp. 137-142, 1998.
[15] M. Hartley, D. Isa, V.P. Kallimani, and L.H. Lee, “A Domain Knowledge Preserving in Process Engineering Using Self-Organizing Concept,” technical report, Intelligent System Group, Faculty of Eng. and Computer Science, Univ. of Nottingham, Malaysia Campus, 2006.
[16] S. Chakrabarti, S. Roy, and M.V. Soundalgekar, “Fast and Accurate Text Classification via Multiple Linear Discriminant Projection,” VLDB J., Int'l J. Very Large Data Bases, pp. 170-185, 2003.
[17] D. Isa, R. Rajkumar, and K.C. Woo, “Defect Detection in Oil and Gas Pipelines Using the Support Vector Machine,” Proc. WSEAS Conf. Circuits, Systems, Electronics and Comm., Dec. 2007.
[18] S.B. Kim, H.C. Rim, D.S. Yook, and H.S. Lim, “Effective Methods for Improving Naïve Bayes Text Classifiers,” Proc. Seventh Pacific Rim Int'l Conf. Artificial Intelligence (PRICAI '02), vol. 2417, 2002.
[19] H. Brucher, G. Knolmayer, and M.A. Mittermayer, “Document Classification Methods for Organizing Explicit Knowledge,” technical report, Research Group Information Eng., Inst. Information System, Univ. of Bern, 2002.
[20] D. Isa, L.H. Lee, and V.P. Kallimani, “A Polychotomizer for Case-Based Reasoning beyond the Traditional Bayesian Classification Approach,” J. Computer and Information Science, Canadian Center of Science and Education, vol. 1, no. 1, pp. 57-68, Feb. 2008.
[21] V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[22] C. O'Brien and C. Vogel, “Spam-Filters: Bayes versus Chisquared; Letter versus Words,” Proc. First Int'l Symp. Information and Comm. Technologies (ISCIT), 2002.
[23] S. Haykin, Neural Networks. A Comprehensive Foundation, second ed. Prentice Hall, 1999.
[24] B. Gutschoven and P. Verlinde, “Multi-Modal Identity Verification Using Support Vector Machines (SVM),” technical report, Signal and Image Centre, Royal Military Academy, psu.edugutschoven00mul timodal.html .
[25] S. Gunn, “Support Vector Machines for Classification and Regression,” technical report, Information: Signals, Images, Systems (ISIS) Research Group, Univ. of Southampton, http://www.ecs.soton. pdfSVM.pdf, 1998.
[26] M. Law, “A Simple Introduction to Support Vector Machines,” Dept. of Computer Science and Eng., Michigan State Univ., Lecture for CSE 802, SVM_new.ppt .
[27] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” technical report, Data Mining and Knowledge Discovery, Bell Laboratories, Lucent Technologies, research. , 1998.
[28] V. Kecman, “Support Vector Machines Basic,” Report 616, School of Eng., Univ. of Auckland, Informatics2 Intro_to_SVM_Report_616_V_Kecman.pdf , 2004.
[29] H. Al-Mubaid and S.A. Umair, “A New Text Categorization Technique Using Distributed Clustering and Learning Logic,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 9, Sept. 2006.
[30] P. Soucy and G.W. Mimeau, “Beyond TF-IDF Weighting for Text Categorization in the Vector Space Model,” Proc. 19th Int'l Joint Conf. Artificial Intelligence (IJCAI '05), pp. 1130-1135, 2005.
[31] D. Isa, V.P. Kalimani, and L.H. Lee, “Using the Self Organizing Map for Clustering Text Documents,” Expert Systems with Applications submitted for publication, 2007.

Index Terms:
Document and Text Processing < Computing Methodologies, Document analysis, Classifier design and evaluation < Design Methodology < Pattern Recognition < Computing Methodologies, Text processing < Applications < Pattern Recognition < Computing Methodologies
Dino Isa, Lam H. Lee, V.P. Kallimani, R. RajKumar, "Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 9, pp. 1264-1272, Sept. 2008, doi:10.1109/TKDE.2008.76
Usage of this product signifies your acceptance of the Terms of Use.