This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Fast Indexing and Visualization of Metric Data Sets using Slim-Trees
March/April 2002 (vol. 14 no. 2)
pp. 244-260

Many recent database applications must deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the Slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The Slim-tree uses the triangle inequality to prune distance calculations needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the Slim-tree uses a Minimal Spanning Tree to help with the split. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The Slim-tree is the first metric access method to tackle the problem of overlap between nodes in metric spaces and to propose a technique to minimize it. The proposed “fat-factor” is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed “Slim-down” algorithm. This paper also presents a new tool in the arsenal of resources of Slim-tree aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new algorithms of the Slim-tree lead to performance improvements. These results show that the Slim-tree outperforms the M-tree up to 200 percent for range queries. For insertion and split, the Minimal-Spanning-Tree-based algorithm achieves up to 40 times faster insertions. We observed improvements up to 40 percent in range queries after applying the Slim-down algorithm.

[1] R. Baeza-Yates, W. Cunto, U. Manber, and S. Wu, “Proximity Matching Using Fixed-Queries Trees,” Combinatorial Pattern Matching, pp. 198-212, 1994.
[2] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, “The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles,” Proc. ACM SIGMOD Conf. Management of Data, 1990.
[3] S. Berchtold, C. Böhm, and H.-P. Kriegel, “A Cost Model for Nearest Neighbor Search in High-Dimensional Data Spaces,” Proc. 16th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS), pp. 78-86, 1997.
[4] T. Bozkaya and M. Ozsoyoglu, “Distance-Based Indexing for High-Dimensional Metric Spaces,” Proc. SIGMOD Int'l Conf. Management of Data, pp. 357-368, 1997.
[5] T. Bozkaya and M. Özsoyoglu, “Indexing Large Metric Spaces for Similarity Search Queries,” ACM Trans. Database Systems, vol. 24, no. 3, pp. 361-404, Sept. 1999.
[6] S. Brin, “Near Neighbour Search in Large Metric Spaces,” Proc. 21st Int'l Conf. Very Large Data Bases, pp. 574-584, Sept. 1995.
[7] W.A. Burkhard and R.M. Keller, “Some Approaches to Best-Match File Searching,” Comm. ACM, vol. 16, no. 4, pp. 230-236, Apr. 1973.
[8] T. Chiueh, “Content Based Image Indexing,” Proc. 20th Int'l Conf. Very Large Data Bases, pp. 582-593, Sept. 1994.
[9] P. Ciaccia and M. Patella, “Bulk Loading the M-Tree,” Proc. ADC Australasian Database Conf., pp. 15-26, 1998.
[10] P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula, “Indexing Metric Spaces with M-Tree,” Proc. Atti del Quinto Convegno Nazionale SEBD, pp. 67-86, June 1997.
[11] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,” Proc. Int'l Conf. Very Large Data Bases, 1997.
[12] P. Ciaccia, M. Patella, and P. Zezula, “A Cost Model for Similarity Queries in Metric Spaces,” Proc. Principles of Database Systems (PODS '98), pp. 59–68, June 1998.
[13] C. Faloutsos and I. Kamel, “Beyond Uniformity and Independence: Analysis of R-Trees Using the Concept of Fractal Dimension,” Proc. 13th ACM Symp. Principles of Database Systems (PODS), 1994.
[14] C. Faloutsos and K.I. Lin, “Fastmap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets,” Proc. SIGMOD, Int'l Conf. Management of Data, pp. 163-174, 1995.
[15] V. Gaede and O. Guenther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 123-169, 1998.
[16] Y.J.R. Garcia, M.A. Lopez, and S.T. Leutenegger, “On Optimal Node Splitting for R-Trees,” Proc. Int'l Conf. Very Large Databases (VLDB '98), 1998.
[17] A. Guttman, “R-Trees: A Dynamic Index Structure for Spatial Searching,” Proc. ACM SIGMOD Conf. Management of Data, 1984.
[18] G. Hristescu and M. Farach-Colton, “Cluster-Preserving Embedding of Proteins,” DIMACS, Technical Report 99-50, 1999.
[19] T. Johnson and D. Shasha, “Utilization of B-Trees with Inserts, Deletes and Modifies,” Proc. ACM Symp. Principles of Database Systems (PODS), pp. 235-246, 1989.
[20] J.B. Kruskal, “On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem,” Proc. Am. Math Soc., vol. 7, pp. 48-50, 1956.
[21] R.T. Martins, R. Hasegawa, M.G.V. Nunes, G. Montilha, and O.N. Oliveira, “Linguistic Issues in the Development of ReGra: A Grammar Checker for Brazilian Portuguesse,” Natural Language Eng., vol. 4, no. 4, pp. 287-307, Dec. 1997.
[22] B. Pagel, F. Korn, and C. Faloutsos, Deflating the Dimensionality Curse Using Multiple Fractal Dimensions Proc. IEEE Int'l Conf. Database Eng., 2000.
[23] T. Sellis, N. Roussopoulos, and C. Faloutsos, “The R+-Tree: A Dynamic Index for Multidimensional Objects,” Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 1987.
[24] M.A. Shah, M. Kornacker, and J.M. Hellerstein, “amdb: A Visual Access Method Development Tool,” Proc. Int'l Workshop User Interfaces to Data Intensive System, pp. 130-140, 1999.
[25] D. Shasha, T.-L. Wang, “New Techniques for Best-Match Retrieval,” ACM Trans. Information Systems, vol. 8, no. 2, pp. 140-158, Apr. 1990.
[26] C. Traina, A.J.M. Traina, and C. Faloutsos, “Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees,” Technical Report CMU-CS-99-110, Carnegie Mellon Univ., Pittsburgh, Pa., Mar. 1999.
[27] C. Traina, A.J.M. Traina, and C. Faloutsos, “Distance Exponent: A New Concept for Selectivity Estimation in Metric Trees,” Proc. Int'l Conf. Data Engineering (ICDE), p. 195, 2000.
[28] C. Traina, A.J.M. Traina, B. Seeger, and C. Faloutsos, “Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes,” Proc. Int'l Conf. Extending Database Technology, pp. 51-65, 2000.
[29] J.K. Uhlmann, “Satisfying General Proximity/Similarity Queries with Metric Trees,” Information Processing Letter, vol. 40, no. 4, pp. 175-179, Nov. 1991.
[30] H. Wactlar, T. Kanade, M.A. Smith, and S.M. Stevens, “Intelligent Access to Digital Video: The Informedia Project,” Computer, vol. 29, no. 5, pp. 46-52, 1996.
[31] J.T.-L. Wang, X. Wang, K.-I. Lin, D. Shasha, B.A. Shapiro, and K. Zhang, “Evaluating a Class of Distance-Mapping Algorithms for Data Mining and Clustering,” Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 307-311, Aug. 1999.
[32] P. Yianilos, “Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces,” Proc. Third Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 311-321, 1993.

Index Terms:
metric databases, metric access methods, index structures, multimedia databases, selectivity estimation, similarity search
Citation:
C. Traina, Jr., A. Traina, C. Faloutsos, B. Seeger, "Fast Indexing and Visualization of Metric Data Sets using Slim-Trees," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 2, pp. 244-260, March-April 2002, doi:10.1109/69.991715
Usage of this product signifies your acceptance of the Terms of Use.