This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
High-Dimensional Similarity Joins
January/February 2002 (vol. 14 no. 1)
pp. 156-171

Abstract—Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the $\epsilon$ tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of finding appropriate branches in the internal nodes. The storage cost for internal nodes is independent of the number of dimensions. Hence, the proposed index structure scales to high-dimensional data. We analyze the cost of the join for the $\epsilon$ tree and the R-tree family, and show that the $\epsilon$ tree will perform better for high-dimensional joins. Empirical evaluation, using synthetic and real-life data sets, shows that similarity join using the $\epsilon$ tree is twice to an order of magnitude faster than the $R^+$ tree, with the performance gap increasing with the number of dimensions. We also discuss how some of the ideas of the $\epsilon$ tree can be applied to the R-tree family. These biased R-trees perform better than the corresponding traditional R-trees for high-dimensional similarity joins, but do not match the performance of the $\epsilon$ tree.

[1] M. Arya, W. Cody, C. Faloutsos, J. Richardson, and A. Toga, “QBISM: A Prototype 3-D Medical Image Database System,” IEEE Data Eng. Bull., vol. 16, no. 1, pp. 38–42, Mar. 1993.
[2] R. Agrawal, C. Faloutsos, and A. Swami, “Efficient Similarity Search in Sequence Databases,” Proc. Fourth Int'l Conf. Foundations of Data Organization and Algorithms, pp. 69-84, Oct. 1993.
[3] R. Agrawal, K. Lin, H.S. Sawhney, and K. Shim, “Fast Similarity Search in the Presence of Noise, Scaling and Translation in Time-Series Databases,” Proc. Very Large Data Bases, pp. 490-501, Sept. 1995.
[4] J.L. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching," Comm. ACM, vol. 18, no. 9, pp. 509-517, 1975.
[5] S. Berchtold, D. Keim, and H.-P. Kriegel, “The X-Tree: An Index Structure for High-Dimensional Data,” Proc. 22nd Conf. Very Large Data Bases, pp. 28-39, 1996.
[6] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[7] C. Faloutsos and K.I. Lin, “Fastmap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 163-174, 1995.
[8] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[9] H.V. Jagadish, “A Retrieval Technique for Similar Shapes,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 208-217, 1991.
[10] K. Lin, H.V. Jagadish, and C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,” VLDB J., vol. 3, pp. 517-542, 1995.
[11] M. Lo and C.V. Ravishankar, “Spatial Joins Using Seeded Trees,” Proc. 1994 ACM SIGMOD Int'l Conf. Management of Data, pp. 209-220, 1994.
[12] D. Lomet and B. Salzberg, "The hB-Tree: A Multiattribute Indexing Method with Good Guaranteed Performance," ACM Trans. Database Systems. vol. 15, no. 4, pp. 625-658, Dec. 1990.
[13] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin, “The QBIC Project: Querying Images by Content Using Color, Texture, and Shape,” Proc. SPIE 1993 Int'l Symp. Electronic Imaging: Science and Technology, Conf. 1908, Storage and Retrieval for Image and Video Databases, 1993.
[14] A.D. Narasimhalu and S. Christodoulakis, “Multimedia Information Systems: The Unfolding of a Reality,” Computer, vol. 24, no. 10, pp. 6–8, 1991.
[15] J. Nievergelt, H. Hinterberger, and K.C. Sevcik, "The Grid File: An Adaptable, Symmetric Multikey File Structure," ACM Trans. Database Systems, vol. 9, no. 1, pp. 38-71, Mar. 1984.
[16] J.M. Patel and D.J. DeWitt, “Partition Based Spatial-Merge Join,” Proc. ACM SIGMOD, pp. 259-270, June 1996.
[17] J.T. Robinson, “The K-D-B-Tree: A Search Structure for Large Multidimensional Dynamic Indexes,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 10-18, 1981.
[18] H. Samet, The Design and Analysis of Spatial Data Structures. Addison-Wesley, 1990.
[19] T. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+-Tree: A Dynamic Index for Multidimensional Objects,” Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 1987.
[20] A.W. Toga, P.K. Banerjee, and E.M. Santori, “Warping 3D Models for Interbrain Comparisons,” Neuroscience Abstract, vol. 16, p. 247, 1990.
[21] D. Vassiliadis, “The Input-State Space Approach to the Prediction of Auroral Geomagnetic Activity from Solar Wind Variables,” Proc. Int'l Workshop Applications of Artificial Intelligence in Solar Terrestrial Physics, Sept. 1993.

Index Terms:
Data mining, similar time sequences, similarity join
Citation:
K. Shim, R. Srikant, R. Agrawal, "High-Dimensional Similarity Joins," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp. 156-171, Jan.-Feb. 2002, doi:10.1109/69.979979
Usage of this product signifies your acceptance of the Terms of Use.