This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'
January/February 2001 (vol. 13 no. 1)
pp. 96-111

Abstract—Spatial queries in high-dimensional spaces have been studied extensively recently. Among them, nearest-neighbor queries are important in many settings, including spatial databases (Find the $k$ closest cities) and multimedia databases (Find the $k$ most similar images). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious “curse of dimensionality.” Here, we show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the intrinsic dimensionality of the data set and not the dimensionality of the address space (referred to as the embedding dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic (“fractal”) dimensionalities that are much lower than their embedding dimension, e.g., due to subtle dependencies between attributes. In this paper, we show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets. The practical contributions of this work are our accurate formulas, which can be used for query optimization in spatial and multimedia databases. The major theoretical contribution is the “deflation” of the dimensionality curse: Our formulas and our experiments show that previous worst-case analyses of nearest-neighbor search in high dimensions are overpessimistic to the point of being unrealistic. The performance depends critically on the intrinsic (“fractal”) dimensionality as opposed to the embedding dimension that the uniformity and independence assumptions incorrectly imply.

[1] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms, pp. 69-84, Oct. 1993.
[2] L. Arge, V. Samoladas, and J.S. Vitter, “On Two-Dimensional Indexability and Optimal Range Search Indexing,” Proc. Principles of Database Systems (PODS '99), pp. 346–357, May 1999.
[3] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[4] A. Belussi and C. Faloutsos, “Estimating the Selectivity of Spatial Queries Using the‘Correlation’Fractal Dimension,” Proc. Very Large Data Bases Conf., pp. 299–310, Sept. 1995.
[5] S. Berchtold, C. Böhm, B. Braunmüller, D. Keim, and H.-P. Kriegel, “Fast Parallel Similarity Search in Multimedia Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1-12, 1997.
[6] S. Berchtold, C. Böhm, and H.-P. Kriegel, “The Pyramid-Technique: Towards Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD Int'l Conf. Managment of Data, 1998.
[7] S. Berchtold, C. Bohm, H.V. Jagadish, H.-P. Kriegel, and J. Sander, “Independent Quantization: An Index Compression Technique for High-Dimensional Data Spaces,” Proc. Int'l Conf. Data Eng. 2000, pp. 577–588, Mar. 2000.
[8] S. Berchtold, C. Böhm, and H.-P. Kriegel, “A Cost Model for Nearest Neighbor Search in High-Dimensional Data Spaces,” Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS), pp. 78-86, 1997.
[9] S. Berchtold, B. Ertl, D.A. Keim, H.-P. Kriegel, and T. Seidl, “Fast Nearest Neighbor Search in High-Dimensional Spaces.,” Proc. Int'l Conf. Data Eng. (ICDE '98), pp. 209–218, Feb. 1998.
[10] S. Berchtold, D. Keim, and H.-P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Conf. Very Large Data Bases, pp. 28-39, 1996.
[11] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is‘Nearest Neighbor’Meaningful?,” Proc. Int'l Conf. Database Theory (ICDT '99), pp. 217–235, Jan. 1999.
[12] A. Borodin, R. Ostrovsky, and Y. Rabani, “Lower Bounds for High Dimensional Nearest Neighbor Search and Related Problems,” Proc. ACM Symp. Theory of Computing (STOC '99), pp. 312–321, May 1999.
[13] S. Christodoulakis,“Implications of certain assumptions in database performance evaluation,” ACM Trans. on Database Systems, vol. 9, no. 2, pp. 163-186, June 1984.
[14] P. Ciaccia, M. Patella, and P. Zezula, “A Cost Model for Similarity Queries in Metric Spaces,” Proc. Principles of Database Systems (PODS '98), pp. 59–68, June 1998.
[15] C. Faloutsos and I. Kamel, “Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension,” Proc. 13th ACM Symp. Principles of Database Systems (PODS), 1994.
[16] C. Faloutsos, B. Seeger, A. Traina, and C. Traina Jr, “Spatial Join Selectivity Using Power Laws,” Proc. Special Interest Group on Management of Data (SIGMOD '00), May 2000.
[17] V. Gaede and O. Guenther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 123-169, 1998.
[18] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. Very Large Data Base Conf. (VLDB '99), pp. 518–529, Sept. 1999.
[19] J. Goldstein and R. Ramakrishnan, “Contrast Plots and P-Sphere Trees: Space vs. Time in NN Searches,” Proc. Very Large Data Bases Conf. (VLDB '00), Sept. 2000.
[20] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[21] J. Hellerstein, E. Koutsoupias, and C. Papadimitriou, “On the Analysis of Indexing Schemes,” Proc. Principles of Database Systems (PODS '97), pp. 249–256, May 1997.
[22] J.M. Hellerstein, J.F. Naughton, and A. Pfeffer, “Generalized Search Trees for Database Systems,” Proc. Very Large Data Base (VLDB) Conf., pp. 562–573, Sept. 1995.
[23] A. Hinneburg, C.C. Aggarwal, and D. Keim, “What Is the Nearest Neighbor in High Dimensional Spaces?” Proc. Very Large Data Base Conf. (VLDB '00), Sept. 2000.
[24] G.R. Hjaltason and H. Samet, “Ranking in Spatial Databases,” Proc. Fourth Int'l Symp. Large Spatial Databases, pp. 83-95, 1995.
[25] I. Kamel and C. Faloutsos, "Hilbert R-Tree: An Improved R-Tree using Fractals," Proc. Int'l Conf. Very Large Data Bases, 1994.
[26] N. Katayama and S. Satoh, “The SR-Tree: An Index Structure for High-Dimensional Nearest Neighbor Queries,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 369-380, 1997.
[27] V. Kobla, D. Doermann, K.-I. Lin, and C. Faloutsos, “Compressed Domain Video Indexing Techniques Using DCT and Motion Vector Information in MPEG Video,” Proc. Int'l Soc. Optical Eng., vol. 2,916, Nov. 1996.
[28] F. Korn, N. Sidiropoulos, C. Faloutsos, E. Siegel, and Z. Protopapas, “Fast Nearest-Neighbor Search in Medical Image Databases,” Proc. Conf. Very Large Data Bases (VLDB '96), Sept. 1996.
[29] S.T. Leutenegger, M.A. Lopez, and J.M. Edgington, “Str: A Simple and Efficient Algorithm for R-Tree Packing,” Proc. Int'l Conf. Data Eng. (ICDE '97), pp. 497–506, Apr. 1997.
[30] K. Lin, H.V. Jagadish, and C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,” VLDB J., vol. 3, pp. 517-542, 1995.
[31] F. Le Lionnais, Le Nombres Remarquables. Paris: Hermann, 1983.
[32] B.-U. Pagel, H.-W. Six, H. Toben, and P. Widmayer, “Towards an Analysis of Range Query Performance,” Proc. 12th ACM Symp. Principles of Database Systems (PODS), 1993.
[33] B.-U. Pagel and H.-W. Six, “Are Window Queries Representative for Arbitrary Range Queries?” Proc. 15th ACM Symp. Principles of Database Systems (PODS), 1996.
[34] A. Papadopoulos and Y. Manolopoulos, “Performance of Nearest Neighbor Queries in R-Trees,” Proc. Sixth Int'l Conf. Database Theory, pp. 394-408, 1997.
[35] Y. Sakurai, M. Yoshikawa, S. Uemura, and H. Kojima, “The A-Tree: An Index Structure for High-Dimensional Spaces Using Relative Approximation,” Proc. Very Large Data Base Conf. (VLDB '00), Sept. 2000.
[36] V. Samoladas and D.P. Miranker, “A Lower Bound Theorem for Indexing Schemes and Its Application to Multidimensional Range Queries,” Proc. Principles of Database Systems Conf. (PODS '98), pp. 44–51, June 1998.
[37] T. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+-Tree: A Dynamic Index for Multidimensional Objects,” Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 1987.
[38] D. Shasha, T.-L. Wang, “New Techniques for Best-Match Retrieval,” ACM Trans. Information Systems, vol. 8, no. 2, pp. 140-158, Apr. 1990.
[39] Y. Theodoridis and T. Sellis, “A Model for the Prediction of R-tree Performance,” Proc. 15th ACM Symp. Principles of Database Systems (PODS), 1996.
[40] M. Turk and A. Pentland, “Eigenfaces for Recognition,” J. Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
[41] H. Wactlar, T. Kanade, M.A. Smith, and S.M. Stevens, “Intelligent Access to Digital Video: The Informedia Project,” Computer, vol. 29, no. 5, pp. 46-52, 1996.
[42] R. Weber, H.-J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. Very Large Data Base Conf. (VLDB '98), pp. 194–205, Aug. 1998.

Index Terms:
Nearest-neighbor search, multimedia indexing, fractals.
Citation:
Flip Korn, Bernd-Uwe Pagel, Christos Faloutsos, "On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'," IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 1, pp. 96-111, Jan.-Feb. 2001, doi:10.1109/69.908983
Usage of this product signifies your acceptance of the Terms of Use.