This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Properties of Embedding Methods for Similarity Searching in Metric Spaces
May 2003 (vol. 25 no. 5)
pp. 530-549

Abstract—Complex data types—such as images, documents, DNA sequences, etc.—are becoming increasingly important in modern database applications. A typical query in many of these applications seeks to find objects that are similar to some target object, where (dis)similarity is defined by some distance function. Often, the cost of evaluating the distance between two objects is very high. Thus, the number of distance evaluations should be kept at a minimum, while (ideally) maintaining the quality of the result. One way to approach this goal is to embed the data objects in a vector space so that the distances of the embedded objects approximates the actual distances. Thus, queries can be performed (for the most part) on the embedded objects. In this paper, we are especially interested in examining the issue of whether or not the embedding methods will ensure that no relevant objects are left out (i.e., there are no false dismissals and, hence, the correct result is reported). Particular attention is paid to the SparseMap, FastMap, and MetricMap embedding methods. SparseMap is a variant of Lipschitz embeddings, while FastMap and MetricMap are inspired by dimension reduction methods for Euclidean spaces (using KLT or the related PCA and SVD). We show that, in general, none of these embedding methods guarantee that queries on the embedded objects have no false dismissals, while also demonstrating the limited cases in which the guarantee does hold. Moreover, we describe a variant of SparseMap that allows queries with no false dismissals. In addition, we show that with FastMap and MetricMap, the distances of the embedded objects can be much greater than the actual distances. This makes it impossible (or at least impractical) to modify FastMap and MetricMap to guarantee no false dismissals.

[1] N. Linial, E. London, and Y. Rabinovich, “The Geometry of Graphs and Some of Its Algorithmic Applications,” Combinatorica, vol. 15, pp. 215-245, 1995.
[2] H. Samet, Applications of Spatial Data Structures. Addison-Wesley, 1990.
[3] H. Samet, The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990.
[4] M. Ankerst, G. Kastenmüller, H.-P. Kriegel, and T. Seidl, “3D Shape Histograms for Similarity Search and Classification in Spatial Databases,” Proc. Advances in Spatial Databases-Sixth Int'l Symp., R.H. Guting, D. Papadias, and F.H. Lochovsky, eds., pp. 207-226, July 1999.
[5] J. Hafner, H.S. Sawhney, W. Equitz, M. Flickner, and W. Niblack, “Efficient Color Histogram Indexing for Quadratic Form Distance Functions,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 729-736, July 1995.
[6] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas, “Fast Nearest-Neighbor Search in Medical Image Databases,” Proc. Conf. Very Large Data Bases (VLDB '96), Sept. 1996.
[7] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms, pp. 69-84, Oct. 1993.
[8] K.P. Chan and A. Fu, “Efficient Time Series Matching by Wavelets,” Proc. Int'l Conf. Data Eng., 1999.
[9] H. Hotelling, “Analysis of a Complex of Statistical Variables into Principal Components,” J. Educational Psychology, vol. 24, pp. 417-441, and pp. 498-520, 1933.
[10] K. Fukunaga, Introduction to Statistical Pattern Recognition, second edition. Academic Press, 1990.
[11] A.V. Oppenheim and R.W. Schafer, Digital Signal Processing. Englewood Cliffs, N.J.: Prentice-Hall, 1975.
[12] C.S. Burrus, R.A. Gopinath, and H. Guo, Introduction to Wavelets and Wavelet Transforms: A Primer. Upper Saddle River, N.J.: Prentice Hall, 1998.
[13] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding.Englewood Cliffs, N.J.: Prentice Hall, 1995.
[14] D. Achlioptas, “Database-Friendly Random Projections,” Proc. 20th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, pp. 274-281, May 2001.
[15] E. Bingham and H. Mannila, “Random Projection in Dimensionality Reduction: Applications to Image and Text Data,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 245-250, Aug. 2001.
[16] N. Gershnfeld, The Nature of Mathematical Modeling, Cambridge Univ. Press, 1999.
[17] J. Bourgain, “On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space,” Israel J. Math., vol. 52, nos. 1-2, pp. 46-52, 1985.
[18] W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz Mappings into a Hilbert Space,” Contemporary Math., vol. 26, pp. 189-206, 1984.
[19] N.J. Young, An Introduction to Hilbert Space. Cambridge, UK: Cambridge Univ. Press, 1988.
[20] C. Faloutsos and K.I. Lin, “Fastmap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 163-174, 1995.
[21] G. Hristescu and M. Farach-Colton, “Cluster-Preserving Embedding of Proteins,” technical report, Rutgers Univ., Piscataway, New Jersey, 1999.
[22] J.B. Kruskal and M. Wish, “Multidimensional Scaling,” technical report, Sage Univ. Series, Beverly Hills, Calif., 1978.
[23] J.T.-L. Wang, X. Wang, K.-I. Lin, D. Shasha, B.A. Shapiro, and K. Zhang, “Evaluating a Class of Distance-Mapping Algorithms for Data Mining and Clustering,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 307-311, Aug. 1999.
[24] G.R. Hjaltason and H. Samet, “Contractive Embedding Methods for Similarity Searching in Metric Spaces,” Computer Science TR-4102, Univ. of Maryland, College Park, Maryland, Feb. 2000.
[25] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman, and A.Y. Wu, “An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions,” J. ACM, vol. 45, no. 6, pp. 891-923, Nov. 1998.
[26] M. Bern, “Approximate Closest-Point Queries in High Dimensions,” Information Processing Letters, vol. 45, no. 2, pp. 95-99, Feb. 1993.
[27] P. Indyk and R. Motwani, “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality,” Proc. ACM Symp. Theory of Computing, pp. 604-613, 1998.
[28] G.R. Hjaltason and H. Samet, “Incremental Similarity Search in Multimedia Databases,” Computer Science Dept. TR-4199, Univ. of Maryland, College Park, Nov. 2000.
[29] T. Seidl and H.-P. Kriegel, “Optimal Multi-Step k-Nearest Neighbor Search,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 154-165, 1998.
[30] G.R. Hjaltason and H. Samet, “Ranking in Spatial Databases,” Proc. Fourth Int'l Symp. Large Spatial Databases, pp. 83-95, 1995.
[31] G.R. Hjaltason and H. Samet, “Distance Browsing in Spatial Databases,” ACM Trans. Database Systems, vol. 24, no. 2, pp. 265-318, June 1999. Also Computer Science TR-3919, Univ. of Maryland, College Park.
[32] N. Linial, E. London, and Y. Rabinovich, “The Geometry of Graphs and Some of Its Algorithmic Applications,” Proc. 35th IEEE Ann. Symp. Foundations of Computer Science, pp. 577-591, Nov. 1994.
[33] M. Linial, N. Linial, N. Tishby, and G. Yona, “Global Self Organization of All Known Protein Sequences Reveals Inherent Biological Signatures,” J. Molecular Biology, vol. 268, no. 2, pp. 539-556, May 1997.
[34] L.J. Cowen and C.E. Priebe, “Randomized Non-Linear Projections Uncover High-Dimensional Structure,” Advances in Applied Math., vol. 19, pp. 319-331, 1997.
[35] A. Farago, T. Linder, and G. Lubosi, "Fast Nearest-Neighbor Search in Dissimilarity Spaces," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 957-962, Sept. 1993.
[36] J. Vleugels and R.C. Veltkamp, “Efficient Image Retrieval through Vantage Objects,” Pattern Recognition, vol. 35, no. 1, pp. 69-80, Jan. 2002.
[37] J.E. Barros, J. French, W. Martin, P.M. Kelly, and T.M. Cannon, “Using the Triangle Inequality to Reduce the Number of Comparisons Required for Similarity-Based Retrieval,” Proc. SPIE, Storage and Retrieval of Still Image and Video Databases IV, I.K. Sethi and R. Jain, eds., vol. 2670, pp. 392-403, Jan. 1996.
[38] L. Mico, J. Oncina, and E. Vidal, “A New Version of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA) with Linear Preprocessing-Time and Memory Requirements,” Pattern Recognition Letters, vol. 15, no. 1, pp. 9-17, Jan. 1994.
[39] M. Shapiro, “The Choice of Reference Points in Best-Match File Searching,” Comm. ACM, vol. 20, pp. 339-343, May 1997.
[40] E. Vidal Ruiz, “An Algorithm for Finding Nearest Neighbours in (Approximately) Constant Average Time,” Pattern Recognition Letters, vol. 4, no. 3, pp. 145-157, July 1986.
[41] T.L. Wang and D. Shasha, “Query Processing for Distance Metrics,” Proc. 16th Int'l Conf. Very Large Databases, D. McLeod, R. Sacks-Davis, and H.-J. Schek, eds., pp. 602-613, Aug. 1990.
[42] K.W. Pettis, T.A. Bailey, A.K. Jain, and R.C. Dubes, “An Intrinsic Dimensionality Estimator from Near-Neighbor Information,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 1, pp. 25-37, 1979.
[43] X. Wang, J.T.L. Wang, K.-I. Lin, D. Shasha, B.A. Shapiro, and K. Zhang, “An Index Structure for Data Mining and Clustering,” Knowledge and Information Systems, vol. 2, no. 2, pp. 161-184, May 2000.
[44] Y. Yang, K. Zhang, X. Wang, J.T.L. Wang, and D. Shasha, “An Approximate Oracle for Distance in Metric Spaces,” Proc. Ninth Ann. Symp. Combinatorial Pattern Matching, M. Farach-Colton, ed., pp. 104-117, July 1998.
[45] K. Zhang, personal communication (unpublished), July 2000.

Index Terms:
Embedding methods, metric spaces, similarity search, multimedia databases, contractiveness, distortion, quality, Lipschitz embeddings, singular value decomposition (SVD), SparseMap, FastMap, MetricMap.
Citation:
Gísli R. Hjaltason, Hanan Samet, "Properties of Embedding Methods for Similarity Searching in Metric Spaces," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 530-549, May 2003, doi:10.1109/TPAMI.2003.1195989
Usage of this product signifies your acceptance of the Terms of Use.