The Community for Technology Leaders
RSS Icon
Issue No.05 - May (2009 vol.31)
pp: 869-883
Herwig Lejsek , Reykjavik University, Reykjavik
Friðrik Heiðar Ásmundsson , Reykjavik University, Reykjavik
Björn Þór Jónsson , Reykjavik University, Reykjavik
Laurent Amsaleg , CNRS-IRISA, RENNES
Over the last two decades, much research effort has been spent on nearest neighbor search in high-dimensional data sets. Most of the approaches published thus far have, however, only been tested on rather small collections. When large collections have been considered, high-performance environments have been used, in particular systems with a large main memory. Accessing data on disk has largely been avoided because disk operations are considered to be too slow. It has been shown, however, that using large amounts of memory is generally not an economic choice. Therefore, we propose the NV-tree, which is a very efficient disk-based data structure that can give good approximate answers to nearest neighbor queries with a single disk operation, even for very large collections of high-dimensional data. Using a single NV-tree, the returned results have high recall but contain a number of false positives. By combining two or three NV-trees, most of those false positives can be avoided while retaining the high recall. Finally, we compare the NV-tree to Locality Sensitive Hashing, a popular method for \epsilon-distance search. We show that they return results of similar quality, but the NV-tree uses many fewer disk reads.
High-dimensional indexing, multimedia indexing, very large databases, approximate searches.
Herwig Lejsek, Friðrik Heiðar Ásmundsson, Björn Þór Jónsson, Laurent Amsaleg, "NV-Tree: An Efficient Disk-Based Index for Approximate Search in Very Large High-Dimensional Collections", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.31, no. 5, pp. 869-883, May 2009, doi:10.1109/TPAMI.2008.130
[1] L. Amsaleg and P. Gros, “Content-Based Retrieval Using Local Descriptors: Problems and Issues from a Database Perspective,” Pattern Analysis and Applications, vol. 4, nos. 2/3, pp. 108-124, 2001.
[2] A. Andoni and P. Indyk, ${\rm E}^{2}{\rm LSH}$ 0.1—User Manual, June 2005.
[3] S.-A. Berrani, L. Amsaleg, and P. Gros, “Approximate Searches: $k$ -Neighbors + Precision,” Proc. 12th ACM Int'l Conf. Information and Knowledge Management, pp. 24-31, 2003.
[4] S. Baluja and M. Covell, “Content Fingerprinting Using Wavelets,” Proc. IET Conf. Multimedia, 2006.
[5] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When Is 'Nearest Neighbor' Meaningful?” Lecture Notes in Computer Science, vol. 1540, pp. 217-235, 1999.
[6] M. Datar, P. Indyk, N. Immorlica, and V. Mirrokni, Locality-Sensitive Hashing Using Stable Distributions. MIT Press, 2006.
[7] R. Fagin, R. Kumar, and D. Sivakumar, “Efficient Similarity Search and Classification via Rank Aggregation,” Proc. ACM SIGMOD '03, pp. 301-312, 2003.
[8] J. Gray and G. Graefe, “The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb,” SIGMOD Record, vol. 26, no. 4, pp. 63-68, 1997.
[9] A. Gionis, P. Indyk, and R. Motwani, “Similarity Search in High Dimensions via Hashing,” Proc. 25th Int'l Conf. Very Large Data Bases, pp. 518-529, 1999.
[10] J. Gray and F. Putzolu, “The 5 Minute Rule for Trading Memory for Disc Accesses and the 10 Byte Rule for Trading Memory for CPU Time,” Proc. ACM SIGMOD '87, pp. 395-398, 1987.
[11] A. Joly, O. Buisson, and C. Frélicot, “Content-Based Copy Detection Using Distortion-Based Probabilistic Similarity Search,” IEEE Trans. Multimedia, vol. 9, no. 2, pp. 293-306, Feb. 2007.
[12] J. Kleinberg, “Two Algorithms for Nearest-Neighbour Search in High Dimensions,” Proc. 29th Ann. ACM Symp. Theory of Computing, pp. 599-608, 1997.
[13] Y. Ke, R. Sukthankar, and L. Huston, “Efficient Near-Duplicate Detection and Sub-Image Retrieval,” Proc. ACM Multimedia Conf., pp.869-876, 2004.
[14] H. Lejsek, F.H. Ásmundsson, B.Þ. Jónsson, and L. Amsaleg, “Efficient and Effective Image Copyright Enforcement,” Proc. Journées Bases de Données Avancées, 2005.
[15] H. Lejsek, F.H. Ásmundsson, B.Þ. Jónsson, and L. Amsaleg, “Scalability of Local Image Descriptors: A Comparative Study,” Proc. ACM Multimedia Conf., pp. 589-598, 2006.
[16] C. Li, E.Y. Chang, H. Garcia-Molina, and G. Wiederhold, “Clindex: Clustering for Approximate Similarity Search in High-Dimensional Spaces,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 4, pp. 792-808, July/Aug. 2002.
[17] T. Liu, “Fast Nonparametric Machine Learning Algorithms for High-Dimensional Massive Data and Applications,” PhD thesis, School of Computer Science, Carnegie Mellon Univ., 2006.
[18] T. Liu, A. Moore, A. Gray, and K. Yang, “An Investigation of Practical Approximate Nearest Neighbor Algorithms,” Proc. Neural Information Processing Systems, pp. 825-832, 2004.
[19] D.G. Lowe, “Object Recognition from Local Scale-Invariant Features,” Proc. Int'l Conf. Computer Vision, pp. 1150-1157, 1999.
[20] D.G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[21] T. Liu, C. Rosenberg, and H.A. Rowley, “Clustering Billions of Images with Large Scale Nearest Neighbor Search,” Proc. IEEE Workshop Applications of Computer Vision, pp. 28-33, 2007.
[22] F.A.P. Petitcolas et al., “A Public Automated Web-Based Evaluation Service for Watermarking Schemes: StirMark Benchmark,” Proc. Electronic Imaging, Security and Watermarking of Multimedia Contents III, pp. 575-584, 2001.
[23] U. Shaft and R. Ramakrishnan, “Theory of Nearest Neighbors Indexability,” ACM Trans. Database Systems, vol. 31, no. 3, pp. 814-838, 2006.
[24] J.K. Uhlmann, “Satisfying General Proximity/Similarity Queries with Metric Trees,” Information Processing Letters, vol. 40, no. 4, pp.175-179, 1991.
[25] R. Weber, H. Schek, and S. Blott, “A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces,” Proc. 24th Int'l Conf. Very Large Data Bases, pp. 194-205, 1998.
10 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool