Subscribe

Issue No.05 - May (2008 vol.20)

pp: 641-652

ABSTRACT

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the Chi-square statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm TCFS, which stands for Text Clustering with Feature Selection. TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes a learning process. We compared TCFS and the k-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.

INDEX TERMS

Text clustering, Text mining, Chi-square statistics, Feature selection, Performance analysis

CITATION

Yanjun Li, Congnan Luo, Soon M. Chung, "Text Clustering with Feature Selection by Using Statistical Data",

*IEEE Transactions on Knowledge & Data Engineering*, vol.20, no. 5, pp. 641-652, May 2008, doi:10.1109/TKDE.2007.190740REFERENCES

- [1] C.C. Aggrawal and P.S. Yu, “Finding Generalized Projected Clusters in High Dimensional Spaces,”
Proc. ACM SIGMOD '00, pp. 70-81, 2000.- [2] L. Bottou and Y. Bengio, “Convergence Properties of the K-means Algorithms,”
Advances in Neural Information Processing Systems, vol. 7, pp. 585-592, 1994.- [3] C. Buckley and A.F. Lewit, “Optimizations of Inverted Vector Searches,”
Proc. Ann. ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '85), pp. 97-110, 1985.- [4] Classic Data Set, ftp://ftp.cs.cornell.edu/pubsmart/, 2008.
- [6] M. Dash and H. Liu, “Feature Selection for Clustering,”
Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '00), pp. 110-121, 2000.- [7] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,”
J. Royal Statistical Soc., vol. 39, no. 1, pp. 1-38, 1977.- [8] G. Forman, “Feature Selection: We've Barely Scratched the Surface,”
IEEE Intelligent Systems, Nov. 2005.- [9] L. Galavotti, F. Sebastiani, and M. Simi, “Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization,”
Proc. Fourth European Conf. Research and Advanced Technology for Digital Libraries (ECDL '00), pp. 59-68, 2000.- [11] J.A. Hartigan,
Clustering Algorithms. John Wiley & Sons, 1975.- [12] T. Liu, S. Liu, Z. Chen, and W. Ma, “An Evaluation on Feature Selection for Text Clustering,”
Proc. Int'l Conf. Machine Learning (ICML '03), 2003.- [13] C. Manning and H. Schutze,
Foundations of Statistical Natural Language Processing. MIT Press, 1999.- [15] G.H. John, R. Kohavi, and K. Pfleger, “Irrelevant Features and the Subset Selection Problem,”
Proc. Int'l Conf. Machine Learning (ICML '94), pp. 121-129, 1994.- [16] J.R. Quinlan, “Induction of Decision Trees,”
Machine Learning, vol. 1, pp. 81-106, 1986.- [17] Reuters-21578 Distribution 1.0, available at http://www.david dlewis.com/resources/testcollections reuters21578, 2008.
- [18] C.J. van Rijsbergen,
Information Retrieval, second ed. Butterworth, 1979.- [20] M. Steinbach, G. Karypis, and V. Kumar, “A Comparison of Document Clustering Techniques,”
Proc. KDD Workshop Text Mining, 2000.- [23] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,”
Proc. Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997. |