This Article 
 Bibliographic References 
 Add to: 
High Dimensional Similarity Joins: Algorithms and Performance Evaluation
January/February 2000 (vol. 12 no. 1)
pp. 3-18

Abstract—Current data repositories include a variety of data types, including audio, images, and time series. State-of-the-art techniques for indexing such data and doing query processing rely on a transformation of data elements into points in a multidimensional feature space. Indexing and query processing then take place in the feature space. In this paper, we study algorithms for finding relationships among points in multidimensional feature spaces, specifically algorithms for multidimensional joins. Like joins of conventional relations, correlations between multidimensional feature spaces can offer valuable information about the data sets involved. We present several algorithmic paradigms for solving the multidimensional join problem and we discuss their features and limitations. We propose a generalization of the Size Separation Spatial Join algorithm, named Multidimensional Spatial Join (MSJ), to solve the multidimensional join problem. We evaluate MSJ along with several other specific algorithms, comparing their performance for various dimensionalities on both real and synthetic multidimensional data sets. Our experimental results indicate that MSJ, which is based on space filling curves, consistently yields good performance across a wide range of dimensionalities.

[1] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms, pp. 69-84, Oct. 1993.
[2] R. Agrawal, K. Lin, H.S. Sawhney, and K. Shim, “Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-Series Databases,” Proc. Very Large Data Bases, pp. 490-501, Sept. 1995.
[3] A. Aggarwal and J. S. Vitter, The Input/Output Complexity of Sorting and related Problems Comm. ACM, vol. 31, no. 9, pp. 1116-1127, 1988.
[4] J.L. Bentley, “Multidimensional Divide-and-Conquer,” Comm. ACM, vol. 23, no. 4, pp. 214-229, Apr. 1980.
[5] S. Berchtold, D. Keim, and H.-P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Conf. Very Large Data Bases, pp. 28-39, 1996.
[6] T. Brinkhoff, H.-P. Kriegel, and B. Seeger, “Efficient Processing of Spatial Joins Using R-trees,” Proc. ACM SIGMOD Conf. Management of Data, 1993.
[7] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[8] C. Faloutsos, Indexing Multimedia Databases. Kluwer, Sept. 1996.
[9] G.S. Fishman, Concepts and Methods in Discrete Event Digital Simulation. John Wiley&Sons, 1973.
[10] C. Faloutsos, M. Ranganathan, and I. Manolopoulos, “Fast Subsequence Matching in Time Series Databases,” Proc. ACM SIGMOD, pp. 419-429, May 1994.
[11] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[12] N. Koudas and K.C. Sevcik, “Size Separation Spatial Join,” Proc. ACM SIGMOD, pp. 324-335, May 1997.
[13] K. Lin, H.V. Jagadish, and C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,” VLDB J., vol. 3, pp. 517-542, 1995.
[14] M.-L. Lo and C.V. Ravishankar, “Spatial Hash-Joins,” Proc. ACM SIGMOD, pp. 247-258, June 1996.
[15] G. Marsaglia, “Random Numbers Fall Mainly in the Planes,” Proc. Nat'l Academy of Science, vol. 61, pp. 25-28, Sept. 1968.
[16] K. Melhorn, Data Structures and Algorithms: III, Multidimensional Searching and Computational Geometry. publisher? June 1991.
[17] J. Orenstein, “Redundancy in Spatial Databases,” Proc. ACM SIGMOD Conf. Management of Data, 1989.
[18] J. Orenstein, “An Algorithm for Computing the Overlay of k-Dimensional Spaces,” Proc. Symp. Large Spatial Databases, pp. 381-400, Aug. 1991.
[19] J.M. Patel and D.J. DeWitt, “Partition Based Spatial-Merge Join,” Proc. ACM SIGMOD, pp. 259-270, June 1996.
[20] F.P. Preparata and M.I. Shamos, Computational Geometry. Springer-Verlag, 1985.
[21] J.T. Robinson, “The K-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 10-18, 1981.
[22] H. Samet, The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990.
[23] T. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+-Tree: A Dynamic Index for Multidimensional Objects,” Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 1987.
[24] K. Shim, R. Srikant, and R. Agrawal, “High-Dimensional Similarity Joins,” Proc. Int'l Conf. Data Eng., also available as IBM Research Report, Apr. 1997.
[25] J.D. Ullman, Database and Knowledge-Based Systems. Rockville Md.: Computer Science Press, June 1989.
[26] P. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces,” Proc. Third Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 311-321, 1993.

Index Terms:
Spatial join, sort merge joins, multiple-key indexes, data structures.
Nick Koudas, Kenneth C. Sevcik, "High Dimensional Similarity Joins: Algorithms and Performance Evaluation," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 1, pp. 3-18, Jan.-Feb. 2000, doi:10.1109/69.842246
Usage of this product signifies your acceptance of the Terms of Use.