This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
CSVD: Clustering and Singular Value Decomposition for Approximate Similarity Search in High-Dimensional Spaces
May/June 2003 (vol. 15 no. 3)
pp. 671-685
Chung-Sheng Li, IEEE Computer Society

Abstract—Nearest-neighbor search of high-dimensionality spaces is critical for many applications, such as content-based retrieval from multimedia databases, similarity search of patterns in data mining, and nearest-neighbor classification. Unfortunately, even with the aid of the commonly used indexing schemes, the performance of nearest-neighbor (NN) queries deteriorates rapidly with the number of dimensions. We propose a method, called Clustering with Singular Value Decomposition (CSVD), which supports efficient approximate processing of NN queries, while maintaining good precision-recall characteristics. CSVD groups homogeneous points into clusters and separately reduces the dimensionality of each cluster using SVD. Cluster selection for NN queries relies on a branch-and-bound algorithm and within-cluster searches can be performed with traditional or in-memory indexing methods. Experiments with texture vectors extracted from satellite images show that CSVD achieves significantly higher dimensionality reduction than plain SVD for the same Normalized Mean Squared Error (NMSE), which translates into a higher efficiency in processing approximate NN queries.

[1] L.D. Bergman et al., “SPIRE, a Digital Library for Scientific Information,” Special issue of Int'l J. Digital Libraries (IJODL):‘in the tradition of Alexandrian Scholars,’ vol. 3, no. 1, pp. 85-99, 2000.
[2] V. Castelli, A. Thomasian, and C.-S. Li, “CSVD: Clustering and Singular Value Decomposition for Approximate Similarity Searches in High-Dimensional Spaces,” Research Report RC21755 (98001), IBM, May 2000.
[3] V. Gaede and O. Guenther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 123-169, 1998.
[4] V. Cherkassky, J.H. Friedman, and H. Wechsler, From Statistics to Neural Networks: Theory and Pattern Recognition Applications, NATO ASI Series, Springer-Verlag, Berlin, 1993.
[5] S. Berchtold, C. Böhm, and H.-P. Kriegel, “A Cost Model for Nearest Neighbor Search in High-Dimensional Data Spaces,” Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS), pp. 78-86, 1997.
[6] B.S. Kim and S.B. Park, "A Fast k Nearest Neighbor Finding Algorithm Based on the Ordered Partition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 761-766, Nov. 1986.
[7] I.T. Jolliffe, Principal Component Analysis. New York: Springer-Verlag, 1986.
[8] M. Beatty and B. Manjunath, “Dimensionality Reduction Using Multi-Dimensional Scaling for Content-Based Retrieval,” Proc. IEEE Int'l Conf. Image Processing (ICIP '97), vol. 2, pp. 835-838, Oct. 1997.
[9] C. Faloutsos and K.I. Lin, “Fastmap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 163-174, 1995.
[10] K. Chakrabarti and S. Mehrotra, The Hybrid Tree: An Index Structure for High-Dimensional Feature Spaces Proc. Int'l Conf. Data Eng., pp. 440-447, 1999.
[11] V. Castelli, “Multidimensional Indexing Structures for Content-Based Retrieval,” Research Report RC 22208 (98723), IBM, 02/13/2001, to appear as chapter 14 of Image Databases, V. Castelli and L.D. Bergman, eds., John Wiley&Sons, 2002.
[12] N. Roussopoulos, S. Kelley, and F. Vincent, “Nearest Neighbor Queries,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 71-79, 1995.
[13] S. Blott and R. Weber, “A Simple Vector-Approximation File for Similarity Search in High-Dimensional Vector Spaces,” technical report, Inst. of Information Systems, ETH, Zurich, Switzerland, 1997.
[14] S. Berchtold, C. Böhm, B. Braunmüller, D. Keim, and H.-P. Kriegel, “Fast Parallel Similarity Search in Multimedia Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1-12, 1997.
[15] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is‘Nearest Neighbor’Meaningful?,” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217–235, Jan. 1999.
[16] F. Korn, H. Jagadish, and C. Faloutsos, “Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 289-300, May 1997.
[17] Y. Young and P. Liu, “Overhead Storage Consideration in a Multilinear Method for Data File Compression,” IEEE Trans. Software Eng., vol. 6, pp. 340-347, July 1980.
[18] V. Castelli, C.-S. Li, and A. Thomasian, “Multidimensional Data Clustering and Dimension Reduction for Indexing and Searching,” US Patent US6122628, 19 Sept. 2000. Application date: 31 Oct. 1997.
[19] A. Thomasian, V. Castelli, and C.-S. Li, “Approximate Nearest Neighbor Searching in High-Dimensionality Spaces the Clustering and Singular Value Decomposition Method,” Proc. SPIE, Multimedia Storage and Archiving Systems III, vol. 3527, pp. 144-154, Nov. 1998.
[20] D.J. Swets and J. Weng, “Hierarchical Discriminant Analysis for Image Retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp. 386-401, May 1999.
[21] R. Cappelli, D. Maio, and D. Maltoni, “Similarity Search Using Multi-Space KL,” Proc. First Int'l Workshop Similarity Search- Database and Expert Systems Applications, pp. 155-160, 1999.
[22] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data Via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, no. 1, pp. 1-38, 1977.
[23] C.C. Aggarwal and P.S. Yu, "Finding Generalized Projected Clusters in High Dimensional Spaces," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2000, pp. 70-81.
[24] K. Chakrabarti and S. Mehrotra, “Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces,” Proc. 26th Int'l Conf. Very Large Data Bases, pp. 89-100, Sept. 2000.
[25] Y. Linde, A. Buzo, R.M. Gray, An Algorithm for Vector Quantizer Design IEEE Trans. Comm., vol. 28, no. 1, pp. 84-95, 1980.
[26] A. Nobel, “Recursive Partitioning to Reduce Distortion,” IEEE Trans. Information Theory, vol. 43, pp. 1122-1133, July 1997.
[27] A. Thomasian, V. Castelli, and C.-S. Li, “RCSVD: Recursive Clustering with Singular Value Decomposition for Dimension Reduction in Content-Based Retrieval of Large Image/Video Databases,” Research Report RC 20704, IBM, Jan. 1997.
[28] J.R. Smith, “Integrated Spatial and Feature Image Systems: Retrieval Analysis and Compression,” PhD dissertation, Columbia Univ., 1997.
[29] Engineering and Scientific Subroutine Library for AIX, Guide and Reference. IBM, 2000.

Index Terms:
Multidimensional indexing, singular value decomposition, clustering, multimedia indexing, curse of dimensionality, principal component analysis.
Citation:
Vittorio Castelli, Alexander Thomasian, Chung-Sheng Li, "CSVD: Clustering and Singular Value Decomposition for Approximate Similarity Search in High-Dimensional Spaces," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 671-685, May-June 2003, doi:10.1109/TKDE.2003.1198398
Usage of this product signifies your acceptance of the Terms of Use.