This Article 
 Bibliographic References 
 Add to: 
D-Cache: Universal Distance Cache for Metric Access Methods
May 2012 (vol. 24 no. 5)
pp. 868-881
Tomáš Skopal, Charles University, Prague
Jakub Lokoč, Charles University, Prague
Benjamin Bustos, University of Chile, Santiago
The caching of accessed disk pages has been successfully used for decades in database technology, resulting in effective amortization of I/O operations needed within a stream of query or update requests. However, in modern complex databases, like multimedia databases, the I/O cost becomes a minor performance factor. In particular, metric access methods (MAMs), used for similarity search in complex unstructured data, have been designed to minimize rather the number of distance computations than I/O cost (when indexing or querying). Inspired by I/O caching in traditional databases, in this paper we introduce the idea of distance caching for usage with MAMs—a novel approach to streamline similarity search. As a result, we present the D-cache, a main-memory data structure which can be easily implemented into any MAM, in order to spare the distance computations spent by queries/updates. In particular, we have modified two state-of-the-art MAMs to make use of D-cache—the M-tree and Pivot tables. Moreover, we present the D-file, an index-free MAM based on simple sequential search augmented by D-cache. The experimental evaluation shows that performance gain achieved due to D-cache is significant for all the MAMs, especially for the D-file.

[1] J.S. Vitter, "External Memory Algorithms and Data Structures: Dealing with Massive Data," ACM Computing Surveys, vol. 33, no. 2, pp. 209-271, , 2001.
[2] C. Böhm, S. Berchtold, and D. Keim, "Searching in High-Dimensional Spaces—Index Structures for Improving the Performance of Multimedia Databases," ACM Computing Surveys, vol. 33, no. 3, pp. 322-373, 2001.
[3] S.D. Carson, "A System for Adaptive Disk Rearrangement," Software—Practice and Experience, vol. 20, no. 3, pp. 225-242, 1990.
[4] W. Effelsberg and T. Haerder, "Principles of Database Buffer Management," ACM Trans. Database Systems, vol. 9, no. 4, pp. 560-595, 1984.
[5] M. Batko, D. Novak, F. Falchi, and P. Zezula, "Scalability Comparison of Peer-to-Peer Similarity Search Structures," Future Generation Computer Systems, vol. 24, no. 8, pp. 834-848, 2008.
[6] P. Zezula, G. Amato, V. Dohnal, and M. Batko, Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer, 2005.
[7] E. Chávez, G. Navarro, R. Baeza-Yates, and J.L. Marroquín, "Searching in Metric Spaces," ACM Computing Surveys, vol. 33, no. 3, pp. 273-321, 2001.
[8] G.R. Hjaltason and H. Samet, "Index-Driven Similarity Search in Metric Spaces," ACM Trans. Database Systems, vol. 28, no. 4, pp. 517-580, 2003.
[9] H. Samet, Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006.
[10] T. Skopal and B. Bustos, "On Index-Free Similarity Search in Metric Spaces," Proc. 20th Int'l Conf. Database and Expert Systems Applications (DEXA '09), pp. 516-531, 2009.
[11] E. Vidal, "New Formulation and Improvements of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA)," Pattern Recognition Letters, vol. 15, no. 1, pp. 1-7, 1994.
[12] M.L. Micó, J. Oncina, and E. Vidal, "An Algorithm for Finding Nearest Neighbour in Constant Average Time with a Linear Space Complexity," Proc. Int'l Conf. Pattern Recognition, 1992.
[13] M.L. Micó, J. Oncina, and R.C. Carrasco, "A Fast Branch & Bound Nearest-Neighbour Classifier in Metric Spaces," Pattern Recognition Letters, vol. 17, no. 7, pp. 731-739, 1996.
[14] E. Chávez, J.L. Marroquín, and R. Baeza-Yates, "Spaghettis: An Array Based Algorithm for Similarity Queries in Metric Spaces," Proc. String Processing and Information Retrieval Symp. & Int'l Workshop Groupware (SPIRE '99), p. 38, 1999.
[15] C. TrainaJr., R.F. Filho, A.J. Traina, M.R. Vieira, and C. Faloutsos, "The Omni-Family of All-Purpose Access Methods: A Simple and Effective Way to Make Similarity Search More Efficient," The VLDB J.—The Int'l J. Very Large Data Bases, vol. 16, no. 4, pp. 483-505, 2007.
[16] T. Skopal, "Pivoting M-Tree: A Metric Access Method for Efficient Similarity Search," Proc. Dateso 2004 Ann. Int'l Workshop DAtabases, TExts, Specifications and Objects, vol. 98, pp. 21-31, http:/, 2004.
[17] P. Ciaccia, M. Patella, and P. Zezula, "M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces," Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB '97), pp. 426-435, 1997.
[18] T. Skopal and J. Lokoč, "New Dynamic Construction Techniques for M-Tree," J. Discrete Algorithms, vol. 7, no. 1, pp. 62-77, 2009.
[19] B. Bustos, G. Navarro, and E. Chávez, "Pivot Selection Techniques for Proximity Searching in Metric Spaces," Pattern Recognition Letters, vol. 24, no. 14, pp. 2357-2366, 2003.
[20] J. Venkateswaran, T. Kahveci, C. Jermaine, and D. Lachwani, "Reference-Based Indexing for Metric Spaces with Costly Distance Measures," The VLDB J.—The Int'l J. Very Large Data Bases, vol. 17, no. 5, pp. 1231-1251, 2008.
[21] J.L. Carter and M.N. Wegman, "Universal Classes of Hash Functions," J. Computer and System Sciences, vol. 18, no. 2, pp. 143-154, 1979.
[22] M. Patella and P. Ciaccia, "The Many Facets of Approximate Similarity Search," Proc. First Int'l Workshop Similarity Search and Applications (SISAP '08), pp. 10-21, 2008.
[23] B. Bustos and G. Navarro, "Probabilistic Proximity Search Algorithms Based on Compact Partitions," J. Discrete Algorithms, vol. 2, no. 1, pp. 115-134, 2004.
[24] B. Nam, H. Andrade, and A. Sussman, "Multiple Range Query Optimization with Distributed Cache Indexing," Proc. ACM/IEEE Conf. High Performance Networking and Computing (SC '06), p. 100, 2006.
[25] J.M. Shim, S.I. Song, Y.S. Min, and J.S. Yoo, "An Efficient Cache Conscious Multi-Dimensional Index Structure," Proc. Int'l Conf. Computational Science and Its Applications (ICCSA '04), vol. 4, pp. 869-876, 2004.
[26] K. Kailing, H.-P. Kriegel, and M. Pfeifle, and S. Schnauer, "Extending Metric Index Structures for Efficient Range Query Processing," Knowledge and Information Systems, vol. 10, no. 2, pp. 211-227, 2006.
[27] J.V. den Bercken and B. Seeger, "An Evaluation of Generic Bulk Loading Techniques," Proc. 27th Int'l Conf. Very Large Data Bases (VLDB '01), pp. 461-470, 2001.
[28] J. Lokoč, "Parallel Dynamic Batch Loading in the M-tree," Proc. Second Int'l Workshop Similarity Search and Applications (SISAP '09), pp. 117-123, 2009.
[29] B. Braunmüller, M. Ester, H.-P. Kriegel, and J. Sander, "Multiple Similarity Queries: A Basic DBMS Operation for Mining in Metric Databases," IEEE Trans. Knowledge and Data Eng., vol. 13, no. 1, pp. 79-95, Jan./Feb. 2001.
[30] R. Paredes, E. Chávez, K. Figueroa, and G. Navarro, "Practical Construction of $k$ -Nearest Neighbor Graphs in Metric Spaces," Proc. Fifth Int'l Workshop Experimental Algorithms (WEA '06), pp. 85-97, 2006.
[31] F. Falchi, C. Lucchese, S. Orlando, R. Perego, and F. Rabitti, "A Metric Cache for Similarity Search," Proc. ACM Workshop Large-Scale Distributed Systems for Information Retrieval (LSDS-IR '08), pp. 43-50, 2008.
[32] F. Falchi, C. Lucchese, S. Orlando, R. Perego, and F. Rabitti, "Caching Content-Based Queries for Robust and Efficient Image Retrieval," Proc. 12th Int'l Conf. Extending Database Technology (EDBT '09), pp. 780-790, 2009.
[33] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, and F. Rabitti, "CoPhIR: A Test Collection for Content-Based Image Retrieval," CoRR abs/0905.4627v2, http:/cophir., 2009.
[34] J. Hafner, H.S. Sawhney, W. Equitz, M. Flickner, and W. Niblack, "Efficient Color Histogram Indexing for Quadratic Form Distance Functions," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 729-736, July 1995.
[35] Y. Rubner, J. Puzicha, C. Tomasi, and J.M. Buhmann, "Empirical Evaluation of Dissimilarity Measures for Color and Texture," Computer Vision Image Understanding, vol. 84, no. 1, pp. 25-43, 2001.
[36] K. Figueroa, G. Navarro, and E. Chavez "Metric Spaces Library, docsmanual.pdf," 2007.
[37] I. Levenshtein,, "Binary Codes Capable of Correcting Deletions, Insertions and Reversals," Soviet Physics-Doklady, vol. 10, no. 8, pp. 707-710, 1966.
[38] J. Lokoč "Cloud of Points Generator, SIRET Research Group, /," 2010.
[39] F. Mémoli and G. Sapiro, "Comparing Point Clouds," Proc. Eurographics/ACM SIGGRAPH Symp. Geometry Processing (SGP '04), pp. 32-40, 2004.
[40] D. Huttenlocher, G. Klanderman, and W. Rucklidge, "Comparing Images Using the Hausdorff Distance," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 850-863, Sept. 1993.

Index Terms:
Metric indexing, similarity search, distance caching, metric access methods, D-cache, MAM, index-free search.
Tomáš Skopal, Jakub Lokoč, Benjamin Bustos, "D-Cache: Universal Distance Cache for Metric Access Methods," IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 5, pp. 868-881, May 2012, doi:10.1109/TKDE.2011.19
Usage of this product signifies your acceptance of the Terms of Use.