This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Introducing a Family of Linear Measures for Feature Selection in Text Categorization
September 2005 (vol. 17 no. 9)
pp. 1223-1232
Text Categorization, which consists of automatically assigning documents to a set of categories, usually involves the management of a huge number of features. Most of them are irrelevant and others introduce noise which could mislead the classifiers. Thus, feature reduction is often performed in order to increase the efficiency and effectiveness of the classification. In this paper, we propose to select relevant features by means of a family of linear filtering measures which are simpler than the usual measures applied for this purpose. We carry out experiments over two different corpora and find that the proposed measures perform better than the existing ones.

[1] F. Sebastiani, “Machine Learning in Automated Text Categorisation,” ACM Computing Survey, vol. 34, no. 1, 2002.
[2] G. Salton and M.J. McGill, An Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[3] G.H. John, R. Kohavi, and K. Pfleger, “Irrelevant Features and the Subset Selection Problem,” Proc. 11th Int'l Conf. Machine Learning, pp. 121-129, 1994.
[4] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorisation,” Proc. 14th Int'l Conf. Machine Learning, pp. 412-420, 1997.
[5] I. Díaz, J. Ranilla, E. Montañés, J. Fernández, and E.F. Combarro, “Improving Performance of Text Categorisation by Combining Filtering and Support Vector,” J. Am. Soc. Information Science and Technology (JASIST), vol. 55, no. 7, pp. 579-592, 2004.
[6] D. Mladenic and M. Grobelnik, “Feature Selection for Unbalanced Class Distribution and Naive Bayes,” Proc. 16th Int'l Conf. Machine Learning, pp. 258-267, 1999.
[7] L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization,” Proc. Fourth European Conf. Research and Advanced Technology for Digital Libraries, pp. 59-68, 2000.
[8] E. Montañés, J. Fernández, I. Díaz, E.F. Combarro, and J. Ranilla, “Measures of Rule Quality for Feature Selection in Text Categorization,” Proc. Fifth Int'l Symp. Intelligent Data Analysis Berlin, vol. 2810, pp. 589-598, 2003.
[9] E. Montañés, I. Díaz, J. Ranilla, E. Combarro, and J. Fernández, “Scoring and Selecting Terms for Text Categorization,” IEEE Intelligent Systems, to appear.
[10] J. Fürnkranz and G. Widmer, “Incremental Reduced Error Pruning,” Proc. Int'l Conf. Machine Learning, pp. 70-77, citeseer.ist. psu.edufurnkranz94incremental.html , 1994.
[11] T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features,” Proc. 10th European Conf. Machine Learning, no. 1398, pp. 137-142, 1998.
[12] M.F. Porter, “An Algorithm for Suffix Stripping,” Program (Automated Library and Information Systems), vol. 14, no. 3, pp. 130-137, 1980.
[13] C. Apte, F. Damerau, and S. Weiss, “Automated Learning of Decision Rules for Text Categorization,” Information Systems, vol. 12, no. 3, pp. 233-251, 1994.
[14] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, pp. 273-297, 1995.
[15] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, “Inductive Learning Algorithms and Representations for Text Categorization,” Proc. Int'l Conf. Information and Knowledge Management, pp. 148-155, 1998.
[16] Y. Yang and X. Liu, “A Re-Examination of Text Categorization Methods,” Proc. 22nd ACM Int'l Conf. Research and Development in Information Retrieval, pp. 42-49, citeseer.nj.nec. comyang99reexamination.html , 1999.
[17] V. Vapnik, The Nature of Statistical Learning Theory. Springer-Verlag, 1995.
[18] National Library of Medicine, “Medical Subject Headings (Mesh),” http://www.nlm.nih.gov/mesh/2002index.html , 1993.

Index Terms:
Index Terms- Text categorization, feature selection, filtering measures, machine learning.
Citation:
El?as F. Combarro, Elena Monta?, Irene D?az, Jos? Ranilla, Ricardo Mones, "Introducing a Family of Linear Measures for Feature Selection in Text Categorization," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 9, pp. 1223-1232, Sept. 2005, doi:10.1109/TKDE.2005.149
Usage of this product signifies your acceptance of the Terms of Use.