CSDL Home IEEE Transactions on Knowledge & Data Engineering 2001 vol.13 Issue No.01 - January/February
Issue No.01 - January/February (2001 vol.13)
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/69.908983
<p><b>Abstract</b>—Spatial queries in high-dimensional spaces have been studied extensively recently. Among them, nearest-neighbor queries are important in many settings, including spatial databases (<it>Find the <tmath>$k$</tmath> closest cities</it>) and multimedia databases (<it>Find the <tmath>$k$</tmath> most similar images</it>). Previous analyses have concluded that nearest-neighbor search is hopeless in high dimensions due to the notorious “curse of dimensionality.” Here, we show that this may be overpessimistic. We show that what determines the search performance (at least for R-tree-like structures) is the <it>intrinsic</it> dimensionality of the data set and <it>not</it> the dimensionality of the address space (referred to as the <it>embedding</it> dimensionality). The typical (and often implicit) assumption in many previous studies is that the data is uniformly distributed, with independence between attributes. However, real data sets overwhelmingly disobey these assumptions; rather, they typically are skewed and exhibit intrinsic (“fractal”) dimensionalities that are much lower than their embedding dimension, e.g., due to subtle dependencies between attributes. In this paper, we show how the Hausdorff and Correlation fractal dimensions of a data set can yield extremely accurate formulas that can predict the I/O performance to within one standard deviation on multiple real and synthetic data sets. The practical contributions of this work are our accurate formulas, which can be used for query optimization in spatial and multimedia databases. The major theoretical contribution is the “deflation” of the dimensionality curse: Our formulas and our experiments show that previous worst-case analyses of nearest-neighbor search in high dimensions are overpessimistic to the point of being unrealistic. The performance depends critically on the intrinsic (“fractal”) dimensionality as opposed to the embedding dimension that the uniformity and independence assumptions incorrectly imply.</p>
Nearest-neighbor search, multimedia indexing, fractals.
Bernd-Uwe Pagel, Flip Korn, "On the 'Dimensionality Curse' and the 'Self-Similarity Blessing'", IEEE Transactions on Knowledge & Data Engineering, vol.13, no. 1, pp. 96-111, January/February 2001, doi:10.1109/69.908983