The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2012 vol.34)
pp: 2393-2406
Jun Wang , Bus. Analytics & Math. Sci. Dept., IBM T.J. Watson Res. Center, Yorktown Heights, NY, USA
S. Kumar , Google Res., New York, NY, USA
Shih-Fu Chang , Dept. of Electr. & Comput. Eng., Columbia Univ., New York, NY, USA
ABSTRACT
Hashing-based approximate nearest neighbor (ANN) search in huge databases has become popular due to its computational and memory efficiency. The popular hashing methods, e.g., Locality Sensitive Hashing and Spectral Hashing, construct hash functions based on random or principal projections. The resulting hashes are either not very accurate or are inefficient. Moreover, these methods are designed for a given metric similarity. On the contrary, semantic similarity is usually given in terms of pairwise labels of samples. There exist supervised hashing methods that can handle such semantic similarity, but they are prone to overfitting when labeled data are small or noisy. In this work, we propose a semi-supervised hashing (SSH) framework that minimizes empirical error over the labeled set and an information theoretic regularizer over both labeled and unlabeled sets. Based on this framework, we present three different semi-supervised hashing methods, including orthogonal hashing, nonorthogonal hashing, and sequential hashing. Particularly, the sequential hashing method generates robust codes in which each hash function is designed to correct the errors made by the previous ones. We further show that the sequential learning paradigm can be extended to unsupervised domains where no labeled pairs are available. Extensive experiments on four large datasets (up to 80 million samples) demonstrate the superior performance of the proposed SSH methods over state-of-the-art supervised and unsupervised hashing techniques.
INDEX TERMS
learning (artificial intelligence), content-based retrieval, file organisation, image retrieval, orthogonal hashing, semisupervised hashing method, large-scale search, hashing-based approximate nearest neighbor search, ANN search, computational efficiency, memory efficiency, locality sensitive hashing, spectral hashing, random projections, principal projections, semantic similarity, SSH framework, information theoretic regularizer, unlabeled sets, nonorthogonal hashing, sequential learning paradigm, content-based image retrieval, sequential hashing method, Artificial neural networks, Semantics, Encoding, Extraterrestrial measurements, Binary codes, Semisupervised learning, Sequential analysis, sequential hashing, Hashing, nearest neighbor search, binary codes, semi-supervised hashing, pairwise labels
CITATION
Jun Wang, S. Kumar, Shih-Fu Chang, "Semi-Supervised Hashing for Large-Scale Search", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.34, no. 12, pp. 2393-2406, Dec. 2012, doi:10.1109/TPAMI.2012.48
REFERENCES
[1] R. Datta, D. Joshi, J. Li, and J.Z. Wang, "Image Retrieval: Ideas, Influences, and Trends of the New Age," ACM Computing Surveys, vol. 40, no. 2, pp. 1-60, 2008.
[2] G. Shakhnarovich, T. Darrell, and P. Indyk, Nearest-Neighbor Methods in Learning and Vision: Theory and Practice. MIT Press, 2006.
[3] P. Indyk and R. Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," Proc. 30th ACM Symp. Theory of Computing, pp. 604-613, 1998.
[4] J. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching," Comm. ACM, vol. 18, no. 9, pp. 509-517, 1975.
[5] J.H. Friedman, J.L. Bentley, and R.A. Finkel, "An Algorithm for Finding Best Matches in Logarithmic Expected Time," ACM Trans. Math. Software, vol. 3, no. 3, pp. 209-226, 1977.
[6] C. Silpa-Anan and R. Hartley, "Optimised Kd-Trees for Fast Image Descriptor Matching," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[7] S. Omohundro, "Efficient Algorithms with Neural Network Behavior," Complex Systems, vol. 1, no. 2, pp. 273-347, 1987.
[8] J. Uhlmann, "Satisfying General Proximity/Similarity Queries with Metric Trees," Information Processing Letters, vol. 40, no. 4, pp. 175-179, 1991.
[9] P. Yianilos, "Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces," Proc. Fourth Ann. ACM-SIAM Symp. Discrete Algorithms, pp. 311-321, 1993.
[10] M. Muja and D.G. Lowe, "Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration," Proc. Int'l Conf. Computer Vision Theory and Applications, pp. 331-340, 2009.
[11] P. Indyk, "Nearest-Neighbor Searching in High Dimensions," Handbook of Discrete and Computational Geometry, J.E. Goodman and J. O'Rourke, eds., CRC Press LLC, 2004.
[12] A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," Proc. 25th Int'l Conf. Very Large Data Bases, pp. 518-529, 1999.
[13] Y. Weiss, A. Torralba, and R. Fergus, "Spectral Hashing," Proc. Advances in Neural Information Processing Systems, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, eds., vol. 21, pp. 1753-1760, 2008.
[14] M. Raginsky and S. Lazebnik, "Locality-Sensitive Binary Codes from Shift-Invariant Kernels," Proc. Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams, and A. Culotta, eds., pp. 1509-1517, 2009.
[15] B. Kulis and K. Grauman, "Kernelized Locality-Sensitive Hashing for Scalable Image Search," Proc. IEEE Int'l Conf. Computer Vision, pp. 2130-2137, 2009.
[16] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, "Content-Based Image Retrieval at the End of the Early Years," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349-1380, Dec. 2000.
[17] B. Kulis, P. Jain, and K. Grauman, "Fast Similarity Search for Learned Metrics," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143-2157, Dec. 2009.
[18] G. Shakhnarovich, "Learning Task-Specific Similarity," PhD dissertation, Massachusetts Inst. of Tech nology, 2005.
[19] G. Hinton and R. Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science, vol. 313, no. 5786, pp. 504-507, 2006.
[20] B. Kulis and T. Darrell, "Learning to Hash with Binary Reconstructive Embeddings," Proc. Advances in Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams, and A. Culotta, eds., vol. 20, pp. 1042-1050, 2009.
[21] Y. Mu, J. Shen, and S. Yan, "Weakly-Supervised Hashing in Kernel Space," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 3344-3351, June 2010.
[22] A. Torralba, R. Fergus, and Y. Weiss, "Small Codes and Large Image Databases for Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[23] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, "Locality-Sensitive Hashing Scheme Based on P-Stable Distributions," Proc. 12th Ann. Symp. Computational Geometry, pp. 253-262, 2004.
[24] M. Bawa, T. Condie, and P. Ganesan, "LSH Forest: Self-Tuning Indexes for Similarity Search," Proc. 14th Int'l Conf. World Wide Web, pp. 651-660, 2005.
[25] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, "Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search," Proc. 33rd Int'l Conf. Very Large Data Bases, pp. 950-961, 2007.
[26] L. Cayton and S. Dasgupta, "A Learning Framework for Nearest Neighbor Search," Proc. Advances in Neural Information Processing Systems 20, J. Platt, D. Koller, Y. Singer, and S. Roweis, eds., pp. 233-240, 2008.
[27] J. Wang, S. Kumar, and S.-F. Chang, "Sequential Projection Learning for Hashing with Compact Codes," Proc. 27th Int'l Conf. Machine Learning, J. Fürnkranz and T. Joachims, eds., pp. 1127-1134, June 2010.
[28] Y. Freund and R. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," Proc. Second European Conf. Computational Learning Theory, pp. 23-37, 1995.
[29] C. Fowlkes, S. Belongie, F. Chung, and J. Malik, "Spectral Grouping Using the Nystrom Method," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 214-225, Feb. 2004.
[30] S. Baluja and M. Covell, "Learning to Hash: Forgiving Hash Functions and Applications," Data Mining and Knowledge Discovery, vol. 17, no. 3, pp. 402-430, 2008.
[31] A. Torralba, R. Fergus, and W. Freeman, "80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1958-1970, Nov. 2008.
[32] A. Oliva and A. Torralba, "Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope," Int'l J. Computer Vision, vol. 42, no. 3, pp. 145-175, 2001.
[33] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y.-T. Zheng, "Nus-Wide: A Real-World Web Image Database from National University of Singapore," Proc. ACM Conf. Image and Video Retrieval, July 2009.
[34] D.G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[35] Y.-G. Jiang, C.-W. Ngo, and J. Yang, "Towards Optimal Bag-of-Features for Object Categorization and Semantic Video Retrieval," Proc. Sixth ACM Int'l Conf. Image and Video Retrieval, pp. 494-501, 2007.
[36] J. Wang, S. Kumar, and S.-F. Chang, "Semi-Supervised Hashing for Scalable Image Retrieval," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 3424-3431, June 2010.
[37] R. Fergus, Y. Weiss, and A. Torralba, "Semi-Supervised Learning in Gigantic Image Collections," Proc. Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. Lafferty, C.K.I. Williams, and A. Culotta, eds., pp. 522-530, 2009.
[38] R. Schapire and Y. Singer, "Boostexter: A Boosting-Based System for Text Categorization," Machine Learning, vol. 39, no. 2, pp. 135-168, 2000.
[39] W. Liu, J. Wang, S. Kumar, and S.-F. Chang, "Hashing with Graphs," Proc. 28th Int'l Conf. Machine Learning, June 2011.
[40] W. Liu, J. Wang, R. Ji, Y. Jiang, and S.-F. Chang, "Supervised Hashing with Kernels," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2074-2081, June 2012.
33 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool