Subscribe

Issue No.05 - May (2012 vol.24)

pp: 868-881

Tomáš Skopal , Charles University, Prague

Jakub Lokoč , Charles University, Prague

Benjamin Bustos , University of Chile, Santiago

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.19

ABSTRACT

The caching of accessed disk pages has been successfully used for decades in database technology, resulting in effective amortization of I/O operations needed within a stream of query or update requests. However, in modern complex databases, like multimedia databases, the I/O cost becomes a minor performance factor. In particular, metric access methods (MAMs), used for similarity search in complex unstructured data, have been designed to minimize rather the number of distance computations than I/O cost (when indexing or querying). Inspired by I/O caching in traditional databases, in this paper we introduce the idea of distance caching for usage with MAMs—a novel approach to streamline similarity search. As a result, we present the D-cache, a main-memory data structure which can be easily implemented into any MAM, in order to spare the distance computations spent by queries/updates. In particular, we have modified two state-of-the-art MAMs to make use of D-cache—the M-tree and Pivot tables. Moreover, we present the D-file, an index-free MAM based on simple sequential search augmented by D-cache. The experimental evaluation shows that performance gain achieved due to D-cache is significant for all the MAMs, especially for the D-file.

INDEX TERMS

Metric indexing, similarity search, distance caching, metric access methods, D-cache, MAM, index-free search.

CITATION

Tomáš Skopal, Jakub Lokoč, Benjamin Bustos, "D-Cache: Universal Distance Cache for Metric Access Methods",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 5, pp. 868-881, May 2012, doi:10.1109/TKDE.2011.19REFERENCES

- [1] J.S. Vitter, "External Memory Algorithms and Data Structures: Dealing with Massive Data,"
ACM Computing Surveys, vol. 33, no. 2, pp. 209-271, citeseer.ist.psu.eduvitter01external.html , 2001.- [2] C. Böhm, S. Berchtold, and D. Keim, "Searching in High-Dimensional Spaces—Index Structures for Improving the Performance of Multimedia Databases,"
ACM Computing Surveys, vol. 33, no. 3, pp. 322-373, 2001.- [3] S.D. Carson, "A System for Adaptive Disk Rearrangement,"
Software—Practice and Experience, vol. 20, no. 3, pp. 225-242, 1990.- [4] W. Effelsberg and T. Haerder, "Principles of Database Buffer Management,"
ACM Trans. Database Systems, vol. 9, no. 4, pp. 560-595, 1984.- [5] M. Batko, D. Novak, F. Falchi, and P. Zezula, "Scalability Comparison of Peer-to-Peer Similarity Search Structures,"
Future Generation Computer Systems, vol. 24, no. 8, pp. 834-848, 2008.- [6] P. Zezula, G. Amato, V. Dohnal, and M. Batko,
Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer, 2005.- [7] E. Chávez, G. Navarro, R. Baeza-Yates, and J.L. Marroquín, "Searching in Metric Spaces,"
ACM Computing Surveys, vol. 33, no. 3, pp. 273-321, 2001.- [8] G.R. Hjaltason and H. Samet, "Index-Driven Similarity Search in Metric Spaces,"
ACM Trans. Database Systems, vol. 28, no. 4, pp. 517-580, 2003.- [9] H. Samet,
Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann, 2006.- [10] T. Skopal and B. Bustos, "On Index-Free Similarity Search in Metric Spaces,"
Proc. 20th Int'l Conf. Database and Expert Systems Applications (DEXA '09), pp. 516-531, 2009.- [11] E. Vidal, "New Formulation and Improvements of the Nearest-Neighbour Approximating and Eliminating Search Algorithm (AESA),"
Pattern Recognition Letters, vol. 15, no. 1, pp. 1-7, 1994.- [12] M.L. Micó, J. Oncina, and E. Vidal, "An Algorithm for Finding Nearest Neighbour in Constant Average Time with a Linear Space Complexity,"
Proc. Int'l Conf. Pattern Recognition, 1992.- [13] M.L. Micó, J. Oncina, and R.C. Carrasco, "A Fast Branch & Bound Nearest-Neighbour Classifier in Metric Spaces,"
Pattern Recognition Letters, vol. 17, no. 7, pp. 731-739, 1996.- [14] E. Chávez, J.L. Marroquín, and R. Baeza-Yates, "Spaghettis: An Array Based Algorithm for Similarity Queries in Metric Spaces,"
Proc. String Processing and Information Retrieval Symp. & Int'l Workshop Groupware (SPIRE '99), p. 38, 1999.- [15] C. TrainaJr., R.F. Filho, A.J. Traina, M.R. Vieira, and C. Faloutsos, "The Omni-Family of All-Purpose Access Methods: A Simple and Effective Way to Make Similarity Search More Efficient,"
The VLDB J.—The Int'l J. Very Large Data Bases, vol. 16, no. 4, pp. 483-505, 2007.- [16] T. Skopal, "Pivoting M-Tree: A Metric Access Method for Efficient Similarity Search,"
Proc. Dateso 2004 Ann. Int'l Workshop DAtabases, TExts, Specifications and Objects, vol. 98, pp. 21-31, http:/www.ceur-ws.org, 2004.- [17] P. Ciaccia, M. Patella, and P. Zezula, "M-Tree: An Efficient Access Method for Similarity Search in Metric Spaces,"
Proc. 23rd Int'l Conf. Very Large Data Bases (VLDB '97), pp. 426-435, 1997.- [18] T. Skopal and J. Lokoč, "New Dynamic Construction Techniques for M-Tree,"
J. Discrete Algorithms, vol. 7, no. 1, pp. 62-77, 2009.- [19] B. Bustos, G. Navarro, and E. Chávez, "Pivot Selection Techniques for Proximity Searching in Metric Spaces,"
Pattern Recognition Letters, vol. 24, no. 14, pp. 2357-2366, 2003.- [20] J. Venkateswaran, T. Kahveci, C. Jermaine, and D. Lachwani, "Reference-Based Indexing for Metric Spaces with Costly Distance Measures,"
The VLDB J.—The Int'l J. Very Large Data Bases, vol. 17, no. 5, pp. 1231-1251, 2008.- [21] J.L. Carter and M.N. Wegman, "Universal Classes of Hash Functions,"
J. Computer and System Sciences, vol. 18, no. 2, pp. 143-154, 1979.- [22] M. Patella and P. Ciaccia, "The Many Facets of Approximate Similarity Search,"
Proc. First Int'l Workshop Similarity Search and Applications (SISAP '08), pp. 10-21, 2008.- [23] B. Bustos and G. Navarro, "Probabilistic Proximity Search Algorithms Based on Compact Partitions,"
J. Discrete Algorithms, vol. 2, no. 1, pp. 115-134, 2004.- [24] B. Nam, H. Andrade, and A. Sussman, "Multiple Range Query Optimization with Distributed Cache Indexing,"
Proc. ACM/IEEE Conf. High Performance Networking and Computing (SC '06), p. 100, 2006.- [25] J.M. Shim, S.I. Song, Y.S. Min, and J.S. Yoo, "An Efficient Cache Conscious Multi-Dimensional Index Structure,"
Proc. Int'l Conf. Computational Science and Its Applications (ICCSA '04), vol. 4, pp. 869-876, 2004.- [26] K. Kailing, H.-P. Kriegel, and M. Pfeifle, and S. Schnauer, "Extending Metric Index Structures for Efficient Range Query Processing,"
Knowledge and Information Systems, vol. 10, no. 2, pp. 211-227, 2006.- [27] J.V. den Bercken and B. Seeger, "An Evaluation of Generic Bulk Loading Techniques,"
Proc. 27th Int'l Conf. Very Large Data Bases (VLDB '01), pp. 461-470, 2001.- [28] J. Lokoč, "Parallel Dynamic Batch Loading in the M-tree,"
Proc. Second Int'l Workshop Similarity Search and Applications (SISAP '09), pp. 117-123, 2009.- [29] B. Braunmüller, M. Ester, H.-P. Kriegel, and J. Sander, "Multiple Similarity Queries: A Basic DBMS Operation for Mining in Metric Databases,"
IEEE Trans. Knowledge and Data Eng., vol. 13, no. 1, pp. 79-95, Jan./Feb. 2001.- [30] R. Paredes, E. Chávez, K. Figueroa, and G. Navarro, "Practical Construction of $k$ -Nearest Neighbor Graphs in Metric Spaces,"
Proc. Fifth Int'l Workshop Experimental Algorithms (WEA '06), pp. 85-97, 2006.- [31] F. Falchi, C. Lucchese, S. Orlando, R. Perego, and F. Rabitti, "A Metric Cache for Similarity Search,"
Proc. ACM Workshop Large-Scale Distributed Systems for Information Retrieval (LSDS-IR '08), pp. 43-50, 2008.- [32] F. Falchi, C. Lucchese, S. Orlando, R. Perego, and F. Rabitti, "Caching Content-Based Queries for Robust and Efficient Image Retrieval,"
Proc. 12th Int'l Conf. Extending Database Technology (EDBT '09), pp. 780-790, 2009.- [33] P. Bolettieri, A. Esuli, F. Falchi, C. Lucchese, R. Perego, T. Piccioli, and F. Rabitti, "CoPhIR: A Test Collection for Content-Based Image Retrieval,"
CoRR abs/0905.4627v2, http:/cophir. isti.cnr.it, 2009.- [34] J. Hafner, H.S. Sawhney, W. Equitz, M. Flickner, and W. Niblack, "Efficient Color Histogram Indexing for Quadratic Form Distance Functions,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 7, pp. 729-736, July 1995.- [35] Y. Rubner, J. Puzicha, C. Tomasi, and J.M. Buhmann, "Empirical Evaluation of Dissimilarity Measures for Color and Texture,"
Computer Vision Image Understanding, vol. 84, no. 1, pp. 25-43, 2001.- [36] K. Figueroa, G. Navarro, and E. Chavez "Metric Spaces Library, http://www.sisap.org/library/metricSpaces/ docsmanual.pdf," 2007.
- [37] I. Levenshtein,, "Binary Codes Capable of Correcting Deletions, Insertions and Reversals,"
Soviet Physics-Doklady, vol. 10, no. 8, pp. 707-710, 1966.- [38] J. Lokoč "Cloud of Points Generator, SIRET Research Group, http://siret.ms.mff.cuni.cz/projectspointgenerator /," 2010.
- [39] F. Mémoli and G. Sapiro, "Comparing Point Clouds,"
Proc. Eurographics/ACM SIGGRAPH Symp. Geometry Processing (SGP '04), pp. 32-40, 2004.- [40] D. Huttenlocher, G. Klanderman, and W. Rucklidge, "Comparing Images Using the Hausdorff Distance,"
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 9, pp. 850-863, Sept. 1993. |