This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Some Effective Techniques for Naive Bayes Text Classification
November 2006 (vol. 18 no. 11)
pp. 1457-1466
While naive Bayes is quite effective in various data mining tasks, it shows a disappointing result in the automatic text classification problem. Based on the observation of naive Bayes for the natural language text, we found a serious problem in the parameter estimation process, which causes poor results in text classification domain. In this paper, we propose two empirical heuristics: per-document text normalization and feature weighting method. While these are somewhat ad hoc methods, our proposed naive Bayes text classifier performs very well in the standard benchmark collections, competing with state-of-the-art text classifiers based on a highly complex learning method such as SVM.

[1] P. Domingos and M. J. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Machine Learning, vol. 29, nos. 2/3, pp. 103-130, 1997.
[2] D.D. Lewis, “Representation and Learning in Information Retrieval,” PhD dissertation, Dept. of Computer Science, Univ. of Massachusetts, Amherst, 1992.
[3] D.D. Lewis, “Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval,” Proc. ECML-98, 10th European Conf. Machine Learning, C. Nédellec and C. Rouveirol, eds., pp. 4-15, 1998.
[4] A.K. McCallum and K. Nigam, “Employing EM in Pool-Based Active Learning for Text Classification,” Proc. ICML-98, 15th Int'l Conf. Machine Learning, J.W. Shavlik, ed., pp. 350-358, 1998.
[5] S. Dumais, J. Plat, D. Heckerman, and M. Sahami, “Inductive Learning Algorithms and Representation for Text Categorization,” Proc. CIKM-98, Seventh ACM Int'l Conf. Information and Knowledge Management, pp. 148-155, 1998.
[6] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. SIGIR-99, 22nd ACM Int'l Conf. Research and Development in Information Retrieval, M.A. Hearst, F. Gey, and R.Tong, eds., pp. 42-49, 1999.
[7] K. Nigam, A.K. McCallum, S. Thrun, and T.M. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, vol. 39, nos. 2/3, pp. 103-134, 2000.
[8] Y. Yang and C.G. Chute, “An Example-Based Mapping Method for Text Categorization and Retrieval,” ACM Trans. Information Systems, vol. 12, no. 3, pp. 252-277, 1994.
[9] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. ECML-98, 10th European Conf. Machine Learning, C. Nédellec and C. Rouveirol, eds., pp. 137-142, 1998,
[10] R.E. Schapire and Y. Singer, “BoosTexter: A Boosting-Based System for Text Categorization,” Machine Learning, vol. 39, nos.2/3, pp. 135-168, 2000.
[11] S.E. Robertson, S. Walker, and M. Hancock-Beaulieu, “Okapi at Trec-7: Automatic Ad Hoc, Filtering, VLC, and Interactive,” Proc. Text REtrieval Conf. (TREC), pp. 199-210, 1998.
[12] A. Singhal, J. Choi, D. Hindle, D.D. Lewis, and F.C.N. Pereira, “AT & T at Trec-7,” Proc. Text REtrieval Conf. (TREC), pp. 186-198, 1998.
[13] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. ICML-97, 14th Int'l Conf. Machine Learning, D.H. Fisher, ed., pp. 412-420. 1997.
[14] D. Koller and M. Sahami, “Toward Optimal Feature Selection,” Proc. Int'l Conf. Machine Learning, pp. 284-292, 1996, citeseer. ist.psu.edukoller96toward.html .
[15] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,” Proc. ICML-97, 14th Int'l Conf. Machine Learning, D.H. Fisher ed., pp. 143-151, 1997.
[16] J. Zhang and K.F. Yu, “What's the Relative Risk? A Method of Correcting the Odds Ratio in Cohort Studies of Common Outcomes,” J. Am. Medical Assoc., vol. 280, no. 19, pp.1690-1691, 1998.
[17] D. Mladeni, “Feature Subset Selection in Text-Learning,” Proc. 10th European Conf. Machine Learning, pp. 95-100, 1998.
[18] B.C. How and K. Narayanan, “An Empirical Study of Feature Selection for Text Categorization Based on Term Weightage,” Proc. IEEE/WIC/ACM Int'l Conf. Web Intelligence (WI '04), pp. 599-602, 2004.
[19] F. Sebastiani, “Machine Learning in Automated Text Categorisation: A Survey,” Technical Report IEI-B4-31-1999, Istituto di Elaborazione dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, revised version, 2001.

Index Terms:
Text classification, naive Bayes classifier, Poisson model, feature weighting.
Citation:
Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, Sung Hyon Myaeng, "Some Effective Techniques for Naive Bayes Text Classification," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457-1466, Nov. 2006, doi:10.1109/TKDE.2006.180
Usage of this product signifies your acceptance of the Terms of Use.