Subscribe

Issue No.04 - April (2011 vol.23)

pp: 627-637

Renchu Guan , Jilin University, Changchun

Xiaohu Shi , Jilin University, Changchun

Maurizio Marchese , University of Trento, Trento

Chen Yang , Jilin University, Changchun

Yanchun Liang , Jilin University, Changchun

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.144

ABSTRACT

Based on an effective clustering algorithm—Affinity Propagation (AP)—we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semisupervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.

INDEX TERMS

Affinity propagation, text clustering, cofeature set, unilateral feature set, significant cofeature set.

CITATION

Renchu Guan, Xiaohu Shi, Maurizio Marchese, Chen Yang, Yanchun Liang, "Text Clustering with Seeds Affinity Propagation",

*IEEE Transactions on Knowledge & Data Engineering*, vol.23, no. 4, pp. 627-637, April 2011, doi:10.1109/TKDE.2010.144REFERENCES

- [1] Y.J. Li, C. Luo, and S.M. Chung, "Text Clustering with Feature Selection by Using Statistical Data,"
IEEE Trans. Knowledge and Data Eng., vol. 20, no. 5, pp. 641-652, May 2008.- [2] C. Buckley and A.F. Lewit, "Optimizations of Inverted Vector Searches,"
Proc. Ann. ACM SIGIR, pp. 97-110, 1985.- [3] N. Jardin and C.J. van Rijsbergen, "The Use of Hierarchic Clustering in Information Retrieval,"
Information Storage and Retrieval, vol. 7, no. 5, pp. 217-240, 1971.- [4] G. Salton,
Dynamic Information and Library Processing. Prentice-Hall, Inc., 1975.- [5] G. Salton and M.J. McGill,
Introduction to Modern Information Retrieval. McGraw Hill Book Co., 1983.- [6] E.M. Voorhees, "The Efficiency of Inverted Index and Cluster Searches,"
Proc. Ann. Int'l ACM SIGIR, pp. 164-174, 1986.- [7] B.J. Frey and D. Dueck, "Clustering by Passing Messages between Data Points,"
Science, vol. 315, no. 5814, pp. 972-976, Feb. 2007.- [8] B.J. Frey and D. Dueck, "Non-Metric Affinity Propagation for Un-Supervised Image Categorization,"
Proc. 11th IEEE Int'l Conf. Computer Vision (ICCV '07), pp. 1-8, Oct. 2007.- [9] L. Michele, Sumedha, and W. Martin, "Clustering by Soft-Constraint Affinity Propagation Applications to Gene-Expression Data,"
Bioinformatics, vol. 23, no. 20, pp. 2708-2715, Sept. 2007.- [10] T.Y. Jiang and A. Tuzhilin, "Dynamic Micro Targeting: Fitness-Based Approach to Predicting Individual Preferences,"
Proc. Seventh IEEE Int'l Conf. Data Mining (ICDM '07), pp. 173-182, Oct. 2007.- [11] H.F. Ma, X.H. Fan, and J. Chen, "An Incremental Chinese Text Classification Algorithm Based on Quick Clustering,"
Proc. 2008 Int'l Symp. Information Processing (ISIP '08), pp. 308-312, May 2008.- [12] W.H. Wang, H.W. Zhang, F. Wu, and Y.T. Zhuang, "Large Scale of E-Learning Resources Clustering with Parallel Affinity Propagation,"
Proc. Int'l Conf. Hybrid Learning 2008 (ICHL '08), pp.1-10, Aug. 2008.- [13] F. Sebastiani, "Machine Learning in Automated Text Categorization,"
ACM Computing Surveys, vol. 34, pp. 1-47, 2002.- [14] G. Salton,
Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989.- [15] F. Wang and C.S. Zhang, "Label Propagation through Linear Neighbourhoods,"
IEEE Trans. Knowledge and Data Eng., vol. 20, no. 1, pp. 55-67, Jan. 2008.- [16] Z.H. Zhou and M. Li, "Semi-Supervised Regression with Co-Training Style Algorithms,"
IEEE Trans. Knowledge and Data Eng., vol. 19, no. 11, pp. 1479-1493, Aug. 2007.- [17] A. Blum and T. Mitchell, "Combining Labeled and Unlabeled Data with Co-Training,"
Proc. 11th Ann. Conf. Computational Learning Theory, pp. 92-100, 1998.- [18] S. Yu, B. Krishnapuram, R. Rosales, H. Steck, and R.B. Rao, "Bayesian Co-Training,"
Advances in Neural Information Processing Systems, vol. 20, pp. 1665-1672, MIT Press, 2008.- [19] M. Belkin and P. Niyogi, "Semi-Supervised Learning on Riemannian Manifolds,"
Machine Learning, vol. 56, pp. 209-239, 2004.- [20] Z.H. Zhou, D.C. Zhan, and Q. Yang, "Semi-Supervised Learning with Very Few Labeled Training Examples,"
Proc. 22nd AAAI Conf. Artificial Intelligence, pp. 675-680, 2007.- [21] Z.H. Zhou and M. Li, "Tri-Training: Exploiting Unlabeled Data Using Three Classifiers,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 11, pp. 1529-1541, Nov. 2005.- [22] O. Chapelle, B. Schölkopf, and A. Zien,
Semi-Supervised Learning. MIT Press, 2006.- [23] C.J. van Rijsbergen,
Information Retrieval, second ed., pp. 22-28. Butterworth, 1979.- [24] J. MacQUEEN, "Some Methods for Classification and Analysis of Multivariate Observations,"
Proc. Fifth Berkeley Symp. Math. Statistics and Probability, pp. 281-297, 1967.- [25] X.D. Wu and C.Q. Zhang, "Efficient Mining of Both Positive and Negative Association Rules,"
ACM Trans. Information Systems, vol. 22, no. 3, pp. 381-405, July 2004.- [26] M.J. Maña-López, M.D. Buenaga, and J.M. Gómez-Hidalgo, "Multidocument Summarization: An Added Value to Clustering in Interactive Retrieval,"
ACM Trans. Information Systems, vol. 22, no. 2, pp. 215-241, Apr. 2004.- [27] L.P. Jing, M.K. Ng, and J.Z. Huang, "An Entropy Weighting K-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data,"
IEEE Trans. Knowledge and Data Eng., vol. 19, no. 8, pp. 1026-1041, Aug. 2007.- [28] S. Huang, Z. Chen, Y. Yu, and W.Y. Ma, "Multitype Features Coselection for Web Document Clustering,"
IEEE Trans. Knowledge and Data Eng., vol. 18, no. 4, pp. 448-458, Apr. 2006.- [29] X.D. Wu et al., "Top 10 Algorithms in Data Mining,"
Knowledge and Information Systems, vol. 14, no. 1, pp. 1-37, Jan. 2008.- [30] M.J. Brusco and H.F. Kohn, "Comment on 'Clustering by Passing Messages between Data Points,'"
Science, vol. 319, no. 5864, p. 726c, Feb. 2008.- [31] B.J. Frey and D. Dueck, "Response to Comment on 'Clustering by Passing Messages Between Data Points,'"
Science, vol. 319, no. 5864, p. 726d, Feb. 2008.- [32] J. Wu, F. Ding, and Q.L. Xiang, "An Affinity Propagation Based Method for Vector Quantization," Eprint arXiv 0710.2037, http://arxiv.org/abs0710.2037v2, Oct. 2007.
- [33] K.J. Wang, J.Y. Zhang, D. Li, X.N. Zhang, and T. Guo, "Adaptive Affinity Propagation Clustering,"
Acta Automatica Sinica, vol. 33, no. 12, pp. 1242-1246, Dec. 2007.- [34] S. Gao, W. Wu, C.H. Lee, and T.S. Chua, "A Maximal Figure-of-Merit (MFoM)-Learning Approach to Robust Classifier Design for Text Categorization,"
ACM Trans. Information Systems, vol. 24, no. 2, pp. 190-218, Apr. 2006.- [35] Z.H. Zhou and M. Li, "Distributional Features for Text Categorization,"
IEEE Trans. Knowledge and Data Eng., vol. 21, no. 3, pp. 428-442, Mar. 2009.- [36] D.D. Lewis, "Reuters-21578 Text Categorization Test Collection," http://www.daviddlewis.com/resources/testcollections reuters21578, May 2004.
- [37] A. Estabrooks, T. Jo, and N. Japkowicz, "A Multiple Resampling Method for Learning from Imbalanced Data Sets,"
Computational Intelligence, vol. 2, no. 1, pp. 18-36, 2004.- [38] S. Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive Learning Algorithms and Representations for Text Categorization,"
Proc. Seventh Int'l Conf. Information and Knowledge Management, pp. 148-155, 1998.- [40] H. Zha, C. Ding, M. Gu, X. He, and H.D. Simon, "Spectral Relaxation for K-Means Clustering,"
Advances in Neural Information Processing Systems, vol. 14, pp. 1057-1064, MIT Press, 2001.- [41] L. Bottou and Y. Bengio, "Convergence Properties of the K-Means,"
Advances in Neural Information Processing Systems, vol. 7, pp. 585-592, MIT Press, 1995. |