This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Indexing High-Dimensional Data for Efficient In-Memory Similarity Search
March 2005 (vol. 17 no. 3)
pp. 339-353
Jianwen Su, IEEE
Kian-Lee Tan, IEEE Computer Society
In main memory systems, the L2 cache typically employs cache line sizes of 32-128 bytes. These values are relatively small compared to high-dimensional data, e.g., > 32D. The consequence is that existing techniques (on low-dimensional data) that minimize cache misses are no longer effective. In this paper, we present a novel index structure, called \Delta{\hbox{-}}{\rm{tree}}, to speed up the high-dimensional query in main memory environment. The \Delta{\hbox{-}}{\rm{tree}} is a multilevel structure where each level represents the data space at different dimensionalities: the number of dimensions increases toward the leaf level. The remaining dimensions are obtained using Principal Component Analysis. Each level of the tree serves to prune the search space more efficiently as the lower dimensions can reduce the distance computation and better exploit the small cache line size. Additionally, the top-down clustering scheme can capture the feature of the data set and, hence, reduces the search space. We also propose an extension, called \Delta^+{\hbox{-}}{\rm{tree}}, that globally clusters the data space and then partitions clusters into small regions. The \Delta^+{\hbox{-}}{\rm{tree}} can further reduce the computational cost and cache misses. We conducted extensive experiments to evaluate the proposed structures against existing techniques on different kinds of data sets. Our results show that the \Delta^+{\hbox{-}}{\rm{tree}} is superior in most cases.

[1] CMU Graphics Lab Motion Capture Database, available from http:/mocap.cs.cmu.edu/, 2004.
[2] Corel Image Features, available from http:/kdd.ics.uci.edu, 2000.
[3] S. Berchtold, C. Böhm, and H.P. Kriegel, “The Pyramid-Tree: Breaking the Curse of Dimensionality,” Proc. ACM SIGMOD Conf., pp. 142-153, 1998.
[4] S. Berchtold, C. Bohm, D. Keim, F. Krebs, and H.P. Kriegel, “On Optimizing Nearest Neighbor Queries in High-Dimensional Data Spaces,” Proc. Eighth Int'l Conf. Database Theory, pp. 435-449, 2001.
[5] P. Bohannon, P. Mcllroy, and R. Rastogi, “Main-Memory Index Structures with Fixed-Size Partial Keys,” Proc. ACM SIGMOD Conf., pp. 163-174, 2001.
[6] C. Bohm, S. Berchtold, and D. Keim, “Searching in High-Dimensional Spaces: Index Structures for Improving the Performance of Multimedia Databases,” ACM Computing Surveys, pp. 322-373, 2001.
[7] T. Bozkaya and M. Ozsoyoglu, “Distance-Based Indexing for High-Dimensional Metric Spaces,” Proc. ACM SIGMOD Conf., pp. 357-368, 1997.
[8] K. Chakrabarti and S. Mehrotra, “Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces,” Proc. 26th Very Large Data Bases Conf., pp. 89-100, 2000.
[9] S. Chen, P.B. Gibbons, and T.C. Mowry, “Improving Index Performance through Prefetching,” Proc. ACM SIGMOD Conf., pp. 139-150, 2001.
[10] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. 24th Very Large Data Bases Conf., pp. 194-205, 1997.
[11] B. Cui, B.C. Ooi, J.W. Su, and K.L. Tan, “Contorting High Dimensional Data for Efficient Main Memory Processing,” Proc. ACM SIGMOD Conf., pp. 479-490, 2003.
[12] B. Cui, B.C. Ooi, J.W. Su, and K.L. Tan, “Main Memory Indexing: The Case for BD-Tree,” IEEE Trans. Knowledge and Data Eng., 2003.
[13] R. Enbody Perfmon: Performance Monitoring Tool, available from http://www.cps.msu.edu/enbodyperfmon.html , 1999.
[14] R.F.S. Filho, A. Traina, C. TrainaJr., and C. Faloutsos, “Similarity Search without Tears: The Omni-Family of All-Purpose Access Methods,” Proc. 17th Int'l Conf. Data Eng., 2001.
[15] G.H. Golub and C.F. Van Loan, Matrix Computations. The Johns Hopkins Univ. Press, 1989.
[16] J. Hui, B.C. Ooi, H. Shen, C. Yu, and A. Zhou, “An Adaptive and Efficient Dimensionality Reduction Algorithm for High-Dimensional Indexing,” Proc. 19th Int'l Conf. Data Eng., 2003.
[17] I.T. Jolliffe, Principal Component Analysis. Springer-Verlag, 1986.
[18] C. TrainaJr., A. Traina, C. Faloutsos, and B. Seeger, “Fast Indexing and Visualization of Metric Data Sets Using Slim-Trees,” IEEE Trans. Knowledge and Data Eng., 2002.
[19] K. Kim, S.K. Cha, and K. Kwon, “Optimizing Multidimensional Index Trees for Main Memory Access,” Proc. ACM SIGMOD Conf., pp. 139-150, 2001.
[20] K. Lin, H.V. Jagadish, and C. Faloutsos, “The TV-Tree: An Index Structure for High-Dimensional Data,” The VLDB J., vol. 3, no. 4, pp. 517-542, 1994.
[21] J. Rao and K. Ross, “Making B+-Trees Cache Conscious in Main Memory,” Proc. ACM SIGMOD Conf., pp. 475-486, 2000.
[22] R. Weber, H.J. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. 24th Very Large Data Bases Conf., pp. 194-205, 1998.
[23] C. Yu, B.C. Ooi, K.L. Tan, and H.V. Jagadish, “Indexing the Distance: An Efficient Method to KNN Processing,” Proc. 27th Very Large Data Bases Conf., pp. 421-430, 2001.

Index Terms:
High-dimensional index, main memory, similarity query.
Citation:
Bin Cui, Beng Chin Ooi, Jianwen Su, Kian-Lee Tan, "Indexing High-Dimensional Data for Efficient In-Memory Similarity Search," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 3, pp. 339-353, March 2005, doi:10.1109/TKDE.2005.46
Usage of this product signifies your acceptance of the Terms of Use.