This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Document Clustering Using Locality Preserving Indexing
December 2005 (vol. 17 no. 12)
pp. 1624-1637
We propose a novel document clustering method which aims to cluster the documents into different semantic classes. The document space is generally of high dimensionality and clustering in such a high dimensional space is often infeasible due to the curse of dimensionality. By using Locality Preserving Indexing (LPI), the documents can be projected into a lower-dimensional semantic space in which the documents related to the same semantics are close to each other. Different from previous document clustering methods based on Latent Semantic Indexing (LSI) or Nonnegative Matrix Factorization (NMF), our method tries to discover both the geometric and discriminating structures of the document space. Theoretical analysis of our method shows that LPI is an unsupervised approximation of the supervised Linear Discriminant Analysis (LDA) method, which gives the intuitive motivation of our method. Extensive experimental evaluations are performed on the Reuters-21578 and TDT2 data sets.

[1] L. Baker and A. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 96-103, Aug. 1998.
[2] B.T. Bartell, G.W. Cottrell, and R.K. Belew, “Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 161-167, June 1992.
[3] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Advances in Neural Information Processing Systems 14, pp. 585-591, Cambridge, Mass.: MIT Press, 2001.
[4] P.K. Chan, D.F. Schlag, and J.Y. Zien, “Spectral k-Way Ratio-Cut Partitioning and Clustering,” IEEE Trans. Computer-Aided Design, vol. 13, pp. 1088-1096, 1994.
[5] F.R.K. Chung, Spectral Graph Theory. Am. Math. Soc., 1997.
[6] David Cohn, “Informed Projections,” Advances in Neural Information Processing Systems 15, pp. 849-856, Cambridge, Mass: MIT Press, 2002.
[7] S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman, “Indexing by Latent Semantic Analysis,” J. Am. Soc. Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[8] C. Ding, X. He, H. Zha, M. Gu, and H.D. Simon, “A Min-Max Cut Algorithm for Graph Partitioning and Data Clustering,” Proc. Int'l Conf. Data Mining, pp. 107-114, Nov. 2001.
[9] C.H. Ding, “A Similarity-Based Probability Model for Latent Semantic Indexing,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 58-65, Aug. 1999.
[10] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Hoboken, N.J.: Wiley-Interscience, 2000.
[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 226-231, Aug. 1996.
[12] K. Funkunaga and P. Navendra, “A Branch and Bound Algorithm for Computing k-Nearest Neighbors,” IEEE Trans. Computers, vol. 24, no. 7, pp. 750-753, 1975.
[13] G.H. Golub and C.F. Van Loan, Matrix Computations, third ed. Johns Hopkins Univ. Press, 1996.
[14] X. He, D. Cai, H. Liu, and W.-Y. Ma, “Locality Preserving Indexing for Document Representation,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 96-103, July 2004.
[15] X. He and P. Niyogi, “Locality Preserving Projections,” Advances in Neural Information Processing Systems 16, Cambridge, Mass.: MIT Press, 2003.
[16] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Upper Saddle River, N.J.: Prentice-Hall, Inc., 1988.
[17] T. Li, S. Ma, and M. Ogihara, “Document Clustering via Adaptive Subspace Iteration,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 218-225, July 2004.
[18] X. Liu, Y. Gong, W. Xu, and S. Zhu, “Document Clustering with Cluster Refinement and Model Selection Capabilities,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 191-198, Aug. 2002.
[19] L. Lovasz and M. Plummer, Matching Theory. North Holland, Budapest: Akadémiai Kiadó, 1986.
[20] J. Mcqueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, pp. 281-297, 1967.
[21] A.Y. Ng, M. Jordan, and Y. Weiss, “On Spectral Clustering: Analysis and an Algorithm,” Advances in Neural Information Processing Systems 14, pp. 849-856, Cambridge, Mass.: MIT Press, 2001.
[22] C.H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala, “Latent Semantic Indexing: A Probabilistic Analysis,” Proc. 17th ACM Symp. Principles of Database Systems, June 1998.
[23] J. Shi and J. Malik, “Normalized Cuts and Image Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, 2000.
[24] S. Siersdorfer and S. Sizov, “Restrictive Clustering and Metaclustering for Self-Organizing Document Collections,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 226-233, July 2004.
[25] N. Srebro and T. Jaakkola, “Linear Dependent Dimensionality Reduction,” Advances in Neural Information Processing Systems 16, Cambridge, Mass.: MIT Press, 2003.
[26] Y. Weiss, “Segmentation Using Eigenvectors: A Unifying View,” Proc. Int'l Conf. Computer Vision, pp. 975-982, Sept. 1999.
[27] W. Xu and Y. Gong, “Document Clustering by Concept Factorization,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 202-209, July 2004.
[28] W. Xu, X. Liu, and Y. Gong, “Document Clustering Based on Non-Negative Matrix Factorization,” Proc. Int'l Conf. Research and Development in Information Retrieval, pp. 267-273, Aug. 2003.
[29] H. Zha, C. Ding, M. Gu, X. He, and H. Simon, “Spectral Relaxation for k-Means Clustering,” Advances in Neural Information Processing Systems 14, pp. 1057-1064, Cambridge, Mass.: MIT Press, 2001.
[30] B. Zhang and S.N. Srihari, “A Fast Algorithm for Finding k-Nearest Neighbors with Non-Metric Dissimilarity,” Proc. Eighth Int'l Workshop Frontiers in Handwriting Recognition, 2002.

Index Terms:
Index Terms- Document clustering, locality preserving indexing, dimensionality reduction, semantics.
Citation:
Deng Cai, Xiaofei He, Jiawei Han, "Document Clustering Using Locality Preserving Indexing," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 12, pp. 1624-1637, Dec. 2005, doi:10.1109/TKDE.2005.198
Usage of this product signifies your acceptance of the Terms of Use.