This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Text Clustering with Feature Selection by Using Statistical Data
May 2008 (vol. 20 no. 5)
pp. 641-652
Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the Chi-square statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm TCFS, which stands for Text Clustering with Feature Selection. TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes a learning process. We compared TCFS and the k-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.

[1] C.C. Aggrawal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. ACM SIGMOD '00, pp. 70-81, 2000.
[2] L. Bottou and Y. Bengio, “Convergence Properties of the K-means Algorithms,” Advances in Neural Information Processing Systems, vol. 7, pp. 585-592, 1994.
[3] C. Buckley and A.F. Lewit, “Optimizations of Inverted Vector Searches,” Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '85), pp. 97-110, 1985.
[4] Classic Data Set, ftp://ftp.cs.cornell.edu/pubsmart/, 2008.
[5] M. Dash and H. Liu, “Feature Selection for Classification,” Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997.
[6] M. Dash and H. Liu, “Feature Selection for Clustering,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '00), pp. 110-121, 2000.
[7] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. 39, no. 1, pp. 1-38, 1977.
[8] G. Forman, “Feature Selection: We've Barely Scratched the Surface,” IEEE Intelligent Systems, Nov. 2005.
[9] L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization,” Proc. Fourth European Conf. Research and Advanced Technology for Digital Libraries (ECDL '00), pp. 59-68, 2000.
[10] M.R. Garey, D.S. Johnson, and H.S. Witsenhausen, “Complexity of the Generalized Lloyd-Max Problem,” IEEE Trans. Information Theory, vol. 28, no. 2, pp. 255-256, 1982.
[11] J.A. Hartigan, Clustering Algorithms. John Wiley & Sons, 1975.
[12] T. Liu, S. Liu, Z. Chen, and W. Ma, “An Evaluation on Feature Selection for Text Clustering,” Proc. Int'l Conf. Machine Learning (ICML '03), 2003.
[13] C. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. MIT Press, 1999.
[14] H.T. Ng, W.B. Goh, and K.L. Low, “Feature Selection, Perception Learning, and a Usability Case Study for Text Categorization,” Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '97), pp. 67-73, 1997.
[15] G.H. John, R. Kohavi, and K. Pfleger, “Irrelevant Features and the Subset Selection Problem,” Proc. Int'l Conf. Machine Learning (ICML '94), pp. 121-129, 1994.
[16] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.
[17] Reuters-21578 Distribution 1.0, available at http://www.david dlewis.com/resources/testcollections reuters21578, 2008.
[18] C.J. van Rijsbergen, Information Retrieval, second ed. Butterworth, 1979.
[19] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[20] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” Proc. KDD Workshop Text Mining, 2000.
[21] W.J. Wilbur and K. Sirotkin, “The Automatic Identification of Stop Words,” J. Information Science, vol. 18, no. 1, pp. 45-55, 1992.
[22] Y. Yang, “Noise Reduction in a Statistical Approach to Text Categorization,” Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '95), pp. 256-263, 1995.
[23] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[24] Y. Zhao and G. Karypis, “Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering,” Machine Learning, vol. 55, no. 3, pp. 311-331, 2004.

Index Terms:
Text clustering, Text mining, Chi-square statistics, Feature selection, Performance analysis
Citation:
Yanjun Li, Congnan Luo, Soon M. Chung, "Text Clustering with Feature Selection by Using Statistical Data," IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 5, pp. 641-652, May 2008, doi:10.1109/TKDE.2007.190740
Usage of this product signifies your acceptance of the Terms of Use.