Subscribe

Issue No.06 - June (2012 vol.24)

pp: 1002-1013

Yuan Yan Tang , Chongqing University, Chongqing

Bin Fang , Chongqing University, Chongqing

Yong Xiang , Deakin University, Geelong

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.49

ABSTRACT

This paper presents a new spectral clustering method called correlation preserving indexing (CPI), which is performed in the correlation similarity measure space. In this framework, the documents are projected into a low-dimensional semantic space in which the correlations between the documents in the local patches are maximized while the correlations between the documents outside these patches are minimized simultaneously. Since the intrinsic geometrical structure of the document space is often embedded in the similarities between the documents, correlation as a similarity measure is more suitable for detecting the intrinsic geometrical structure of the document space than euclidean distance. Consequently, the proposed CPI method can effectively discover the intrinsic structures embedded in high-dimensional document space. The effectiveness of the new method is demonstrated by extensive experiments conducted on various data sets and by comparison with existing document clustering methods.

INDEX TERMS

Document clustering, correlation measure, correlation latent semantic indexing, dimensionality reduction.

CITATION

Yuan Yan Tang, Bin Fang, Yong Xiang, "Document Clustering in Correlation Similarity Measure Space",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 6, pp. 1002-1013, June 2012, doi:10.1109/TKDE.2011.49REFERENCES

- [1] R.T. Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining,"
Proc. 20th Int'l Conf. Very Large Data Bases (VLDB), pp. 144-155, 1994.- [2] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review,"
ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.- [3] S. Kotsiantis and P. Pintelas, "Recent Advances in Clustering: A Brief Survey,"
WSEAS Trans. Information Science and Applications, vol. 1, no. 1, pp. 73-81, 2004.- [4] J.B. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations,"
Proc. Fifth Berkeley Symp. Math. Statistics and Probability, vol. 1, pp. 281-297, 1967.- [5] L.D. Baker and A.K. McCallum, "Distributional Clustering of Words for Text Classification,"
Proc. 21st Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 96-103, 1998.- [6] X. Liu, Y. Gong, W. Xu, and S. Zhu, "Document Clustering with Cluster Refinement and Model Selection Capabilities,"
Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '02), pp. 191-198, 2002.- [7] S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman, "Indexing by Latent Semantic Analysis,"
J. Am. Soc. Information Science, vol. 41, no. 6, pp. 391-407, 1990.- [8] D. Cai, X. He, and J. Han, "Document Clustering Using Locality Preserving Indexing,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 12, pp. 1624-1637, Dec. 2005.- [9] W. Xu, X. Liu, and Y. Gong, "Document Clustering Based on Non-Negative Matrix Factorization,"
Proc. 26th Ann. Int'l ACM SIGIR Conf. Research and Development in Informaion Retrieval (SIGIR '03), pp. 267-273, 2003.- [10] S. Zhong and J. Ghosh, "Generative Model-Based Document Clustering: A Comparative Study,"
Knowledge of Information System, vol. 8, no. 3, pp. 374-384, 2005.- [11] D.K. Agrafiotis and H. Xu, "A Self-Organizing Principle for Learning Nonlinear Manifolds,"
Proc. Nat'l Academy of Sciences USA, vol. 99, no. 25, pp. 15869-15872, 2002.- [12] S. Zhong and J. Ghosh, "Scalable, Balanced Model-Based Clustering,"
Proc. Third SIAM Int'l Conf. Data Mining, pp. 71-82, 2003.- [13] Y. Fu, S. Yan, and T.S. Huang, "Correlation Metric for Generalized Feature Extraction,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 12, pp. 2229-2235, Dec. 2008.- [14] Y. Ma, S. Lao, E. Takikawa, and M. Kawade, "Discriminant Analysis in Correlation Similarity Measure Space,"
Proc. 24th Int'l Conf. Machine Learning (ICML '07), pp. 577-584. 2007,- [15] R.D. Juday, B.V.K. Kumar, and A. Mahalanobis,
Correlation Pattern Recognition. Cambridge Univ. Press, 2005.- [16] D.R. Hardoon, S.R. Szedmak, and J.R. Shawe-taylor, "Canonical Correlation Analysis: An Overview with Application to Learning Methods,"
J. Neural Computation, vol. 16, no. 12, pp. 2639-2664, 2004.- [17] D.M. Blei, A.Y. Ng, and M.I. Jordan, "Latent Dirichlet Allocation,"
J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.- [18] J.B. Tenenbaum, V. de Silva, and J.C. Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction,"
Science, vol. 290, no. 5500, pp. 2319-2323, Dec. 2000.- [19] X. Zhu, Z. Ghahramani, and J. Lafferty, "Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions,"
Proc. 20th Int'l Conf. Machine Learning (ICML '03), 2003.- [20] X. Zhu, "Semi-Supervised Learning Literature Survey," technical report, Computer Sciences, Univ. of Wisconsin-Madison, 2005.
- [21] G. Lebanon, "Metric Learning for Text Documents,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 497-507, Apr. 2006.- [22] F.Y.M. Wan,
Introduction to Calculus of Variations and Its Applications. Chapman&Hall, 1995.- [23]
Encyclopaedia of Mathematics. M. Hazewinkel, ed., Springer-Verlag, http://eom.springer.de/Ll057190.htm, 2002.- [24] I.S. Dhillon and D.M. Modha, "Concept Decompositions for Large Sparse Text Data Using Clustering,"
Machine Learning, vol. 42, no. 1, pp. 143-175, 2001.- [25] P. Strobach, "Bi-Iteration SVD Subspace Tracking Algorithms,"
IEEE Trans. Signal Processing, vol. 45, no. 5, pp. 1222-1240, May 1997.- [26] D. Zeimpekis and E. Gallopoulos, "Design of a Matlab Toolbox for Term-Document Matrix Generation,"
Proc. Workshop Clustering High Dimensional Data and Its Applications at the Fifth SIAM Int'l Conf. Data Mining (SDM '05), pp. 38-48, 2005.- [27] L. Lovasz and M. Plummer,
Matching Theory. Elsevier, 1986.- [28] H. Zha, C. Ding, M. Gu, X. He, and H. Simon, "Spectral Relaxation for k-Means,"
Neural Information Processing Systems, vol. 14 (NIPS 2001), pp. 1057-1064, MIT Press, 2001.- [29] D. Cheng, R. Kannan, S. Vempala, and G. Wang, "A Divide-and-Merge Methodology for Clustering,"
ACM Trans. Database Systems, vol. 31, no. 4, pp. 1499-1525, 2006.- [30] T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," LS VIII-Report LS8-Report 23, Universitat Dortmund, 1997.
- [31] A. Ng, M. Jordan, and Y. Weiss, "On Spectral Clustering: Analysis and an Algorithm,"
Proc. Advances in Neural Information Processing Systems, 2001. |