The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2008 vol.20)
pp: 641-652
ABSTRACT
Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the Chi-square statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm TCFS, which stands for Text Clustering with Feature Selection. TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes a learning process. We compared TCFS and the k-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.
INDEX TERMS
Text clustering, Text mining, Chi-square statistics, Feature selection, Performance analysis
CITATION
Congnan Luo, Yanjun Li, "Text Clustering with Feature Selection by Using Statistical Data", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 5, pp. 641-652, May 2008, doi:10.1109/TKDE.2007.190740
REFERENCES
[1] C.C. Aggrawal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,” Proc. ACM SIGMOD '00, pp. 70-81, 2000.
[2] L. Bottou and Y. Bengio, “Convergence Properties of the K-means Algorithms,” Advances in Neural Information Processing Systems, vol. 7, pp. 585-592, 1994.
[3] C. Buckley and A.F. Lewit, “Optimizations of Inverted Vector Searches,” Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '85), pp. 97-110, 1985.
[4] Classic Data Set, ftp://ftp.cs.cornell.edu/pubsmart/, 2008.
[5] M. Dash and H. Liu, “Feature Selection for Classification,” Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997.
[6] M. Dash and H. Liu, “Feature Selection for Clustering,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '00), pp. 110-121, 2000.
[7] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., vol. 39, no. 1, pp. 1-38, 1977.
[8] G. Forman, “Feature Selection: We've Barely Scratched the Surface,” IEEE Intelligent Systems, Nov. 2005.
[9] L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization,” Proc. Fourth European Conf. Research and Advanced Technology for Digital Libraries (ECDL '00), pp. 59-68, 2000.
[10] M.R. Garey, D.S. Johnson, and H.S. Witsenhausen, “Complexity of the Generalized Lloyd-Max Problem,” IEEE Trans. Information Theory, vol. 28, no. 2, pp. 255-256, 1982.
[11] J.A. Hartigan, Clustering Algorithms. John Wiley & Sons, 1975.
[12] T. Liu, S. Liu, Z. Chen, and W. Ma, “An Evaluation on Feature Selection for Text Clustering,” Proc. Int'l Conf. Machine Learning (ICML '03), 2003.
[13] C. Manning and H. Schutze, Foundations of Statistical Natural Language Processing. MIT Press, 1999.
[14] H.T. Ng, W.B. Goh, and K.L. Low, “Feature Selection, Perception Learning, and a Usability Case Study for Text Categorization,” Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '97), pp. 67-73, 1997.
[15] G.H. John, R. Kohavi, and K. Pfleger, “Irrelevant Features and the Subset Selection Problem,” Proc. Int'l Conf. Machine Learning (ICML '94), pp. 121-129, 1994.
[16] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, pp. 81-106, 1986.
[17] Reuters-21578 Distribution 1.0, available at http://www.david dlewis.com/resources/testcollections reuters21578, 2008.
[18] C.J. van Rijsbergen, Information Retrieval, second ed. Butterworth, 1979.
[19] F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47, 2002.
[20] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,” Proc. KDD Workshop Text Mining, 2000.
[21] W.J. Wilbur and K. Sirotkin, “The Automatic Identification of Stop Words,” J. Information Science, vol. 18, no. 1, pp. 45-55, 1992.
[22] Y. Yang, “Noise Reduction in a Statistical Approach to Text Categorization,” Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '95), pp. 256-263, 1995.
[23] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[24] Y. Zhao and G. Karypis, “Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering,” Machine Learning, vol. 55, no. 3, pp. 311-331, 2004.
29 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool