The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2014 vol.36)
pp: 521-535
Jose Costa Pereira , Dept. of Electr. & Comput. Eng., Univ. of California, San Diego, La Jolla, CA, USA
Emanuele Coviello , Dept. of Electr. & Comput. Eng., Univ. of California, San Diego, La Jolla, CA, USA
Gabriel Doyle , Dept. of Linguistics, Univ. of California, San Diego, La Jolla, CA, USA
Nikhil Rasiwasia , Yahoo!Labs., Bangalore, India
Gert R. G. Lanckriet , Dept. of Electr. & Comput. Eng., Univ. of California, San Diego, La Jolla, CA, USA
Roger Levy , Dept. of Linguistics, Univ. of California, San Diego, La Jolla, CA, USA
Nuno Vasconcelos , Dept. of Electr. & Comput. Eng., Univ. of California, San Diego, La Jolla, CA, USA
ABSTRACT
The problem of cross-modal retrieval from multimedia repositories is considered. This problem addresses the design of retrieval systems that support queries across content modalities, for example, using an image to search for texts. A mathematical formulation is proposed, equating the design of cross-modal retrieval systems to that of isomorphic feature spaces for different content modalities. Two hypotheses are then investigated regarding the fundamental attributes of these spaces. The first is that low-level cross-modal correlations should be accounted for. The second is that the space should enable semantic abstraction. Three new solutions to the cross-modal retrieval problem are then derived from these hypotheses: correlation matching (CM), an unsupervised method which models cross-modal correlations, semantic matching (SM), a supervised technique that relies on semantic representation, and semantic correlation matching (SCM), which combines both. An extensive evaluation of retrieval performance is conducted to test the validity of the hypotheses. All approaches are shown successful for text retrieval in response to image queries and vice versa. It is concluded that both hypotheses hold, in a complementary form, although evidence in favor of the abstraction hypothesis is stronger than that for correlation.
INDEX TERMS
Semantics, Correlation, Multimedia communication, Joints, Hidden Markov models, Vectors, Databases,logistic regression, Multimedia, content-based retrieval, multimodal, cross-modal, image and text, retrieval model, semantic spaces, kernel correlation
CITATION
Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Nikhil Rasiwasia, Gert R. G. Lanckriet, Roger Levy, Nuno Vasconcelos, "On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.36, no. 3, pp. 521-535, March 2014, doi:10.1109/TPAMI.2013.142
REFERENCES
[1] G. Salton and M. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[2] A. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, "Content-Based Image Retrieval at the End of the Early Years," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349-1380, Dec. 2000.
[3] B. Logan and A. Salomon, "A Music Similarity Function Based on Signal Analysis," Proc. IEEE Int'l Conf. Multimedia and Expo, pp. 745-748, 2001.
[4] S. Sclaroff, M. Cascia, S. Sethi, and L. Taycher, "Unifying Textual and Visual Cues for Content-Based Image Retrieval on the World Wide Web," J. Computer Vision and Image Understanding, vol. 75, no. 1, pp. 86-98, 1999.
[5] C. Frankel, M. Swain, and V. Athitsos, "Webseer: An Image Search Engine for the World Wide Web," technical report, Computer Science Dept., Univ. of Chicago, 1996.
[6] W. Li, K. Candan, and K. Hirata, "SEMCOG: An Integration of SEMantics and COGnition-Based Approaches for Image Retrieval," Proc. ACM Symp. Applied Computing, pp. 136-143, 1997.
[7] K. Barnard and D. Forsyth, "Learning the Semantics of Words and Pictures," Proc. IEEE Int'l Conf. Computer Vision, vol. 2, pp. 408-415, 2001.
[8] L. Denoyer and P. Gallinari, "Bayesian Network Model for Semi-Structured Document Classification," Information Processing and Management, vol. 40, no. 5, pp. 807-827, 2004.
[9] C. Snoek and M. Worring, "Multimodal Video Indexing: A Review of the State-of-the-Art," J. Multimedia Tools and Applications, vol. 25, no. 1, pp. 5-35, 2005.
[10] R. Datta, D. Joshi, J. Li, and J. Wang, "Image Retrieval: Ideas, Influences, and Trends of the New Age," ACM Computing Surveys, vol. 40, no. 2, pp. 1-60, 2008.
[11] J. Iria, F. Ciravegna, and J. Magalhães, "Web News Categorization Using a Cross-Media Document Graph," Proc. ACM Int'l Conf. Image and Video Retrieval, pp. 1-8, 2009.
[12] A.F. Smeaton, P. Over, and W. Kraaij, "Evaluation Campaigns and TRECVid," Proc. Eighth ACM Int'l Workshop Multimedia Information Retrieval, pp. 321-330, 2006.
[13] T. Tsikrika and J. Kludas, "Overview of the Wikipedia Multimedia Task at ImageCLEF 2008," Evaluating Systems for Multilingual and Multimodal Information Access, pp. 539-550, Springer, 2009.
[14] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. Blei, and M. Jordan, "Matching Words and Pictures," J. Machine Learning Research, vol. 3, pp. 1107-1135, 2003.
[15] Y. Mori, H. Takahashi, and R. Oka, "Automatic Word Assignment to Images Based on Image Division and Vector Quantization," Proc. Recherche d'Information Assistée par Ordinateur, 2000.
[16] G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos, "Supervised Learning of Semantic Classes for Image Annotation and Retrieval," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 394-410, Mar. 2007.
[17] G. Tzanetakis and P. Cook, "Musical Genre Classification of Audio Signals," IEEE Trans. Speech and Audio Processing, vol. 10, no. 5, pp. 293-302, July 2002.
[18] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, "Semantic Annotation and Retrieval of Music and Sound Effects," IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 467-476, Feb. 2008.
[19] D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green, "Automatic Generation of Social Tags for Music Recommendation," Proc. Advances in Neural Information Processing Systems, vol. 20, pp. 385-392, 2008.
[20] D. Hardoon, S. Szedmak, and J. Shawe-Taylor, "Canonical Correlation Analysis: An Overview with Application to Learning Methods," J. Neural Computation, vol. 16, no. 12, pp. 2639-2664, 2004.
[21] D. Blei, A. Ng, and M. Jordan, "Latent Dirichlet Allocation," J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[22] N. Rasiwasia, P. Moreno, and N. Vasconcelos, "Bridging the Gap: Query by Semantic Example," IEEE Trans. Multimedia, vol. 9, no. 5, pp. 923-938, Aug. 2007.
[23] I. Khan, A. Saffari, and H. Bischof, "TVGraz: MultiModal Learning of Object Categories by Combining Textual and Visual Features," Proc. 33rd Workshop Austrian Assoc. for Pattern Recognition, 2009.
[24] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[25] A. Vinokourov, D. Hardoon, and J. Shawe-Taylor, "Learning the Semantics of Multimedia Content with Application to Web Image Retrieval and Classification," Proc. Fourth Int'l Symp. Independent Component Analysis and Blind Source Separation, 2003.
[26] N. Rasiwasia, J.C. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos, "A New Approach to Cross-Modal Multimedia Retrieval," Proc. ACM Int'l Conf. Multimedia, pp. 251-260, 2010.
[27] M. Paramita, M. Sanderson, and P. Clough, "Diversity in Photo Retrieval: Overview of the ImageCLEF 2009 Photo Task," Multilingual Information Access Evaluation: Multimedia Experiments, pp. 45-59, Springer, 2010.
[28] C. Meadow, B. Boyce, D. Kraft, and C. Barry, Text Information Retrieval Systems. Emerald Group, 2007.
[29] G. Salton, The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice Hall, 1971.
[30] N. Vasconcelos, "Minimum Probability of Error Image Retrieval," IEEE Trans. Signal Processing, vol. 52, no. 8, pp. 2322-2336, Aug. 2004.
[31] F. Monay and D. Gatica-Perez, "Modeling Semantic Aspects for Cross-Media Image Indexing," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1802-1817, Oct. 2007.
[32] J. Jeon, V. Lavrenko, and R. Manmatha, "Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models," Proc. 26th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 119-126, 2003.
[33] V. Lavrenko, R. Manmatha, and J. Jeon, "A Model for Learning the Semantics of Pictures," Proc. Advances in Neural Information Processing Systems, vol. 16, 2004.
[34] S. Feng, R. Manmatha, and V. Lavrenko, "Multiple Bernoulli Relevance Models for Image and Video Annotation," Proc. IEEE Conf. Computer Vision on Pattern Recognition, vol. 2, pp. 1002-1009, 2004.
[35] J.Z. Wang and J. Li, "Learning-Based Linguistic Indexing of Pictures with 2-D MHMMs," Proc. ACM Int'l Conf. Multimedia, pp. 436-445, 2002.
[36] N. Vasconcelos, "From Pixels to Semantic Spaces: Advances in Content-Based Image Retrieval," IEEE Trans. Computers, vol. 40, no. 7, pp. 20-26, July 2007.
[37] T. Westerveld, "Image Retrieval: Content versus Context," Proc. Content-Based Multimedia Information Access at Recherche d'Information Assistée par Ordinateur, pp. 276-284, 2000.
[38] T. Pham, N. Maillot, J. Lim, and J. Chevallet, "Latent Semantic Fusion Model for Image Retrieval and Annotation," Proc. ACM Int'l Conf. Information and Knowledge Management, pp. 439-444, 2007.
[39] H. Escalante, C. Hérnadez, L. Sucar, and M. Montes, "Late Fusion of Heterogeneous Methods for Multimedia Image Retrieval," Proc. ACM Int'l Conf. Multimedia Information Retrieval, pp. 172-179, 2008.
[40] G. Wang, D. Hoiem, and D. Forsyth, "Building Text Features for Object Image Classification," Proc. IEEE Conf. Computer Vision on Pattern Recognition, pp. 1367-1374, 2009.
[41] T. Kliegr, K. Chandramouli, J. Nemrava, V. Svatek, and E. Izquierdo, "Combining Image Captions and Visual Analysis for Image Concept Classification," Proc. Workshop Neural Networks for Signal Processing at ACM SIG Int'l Conf. Knowledge Discovery and Data Mining, pp. 8-17, 2008.
[42] S. Nakamura, "Statistical Multimodal Integration for Audio-Visual Speech Processing," IEEE Trans. Neural Networks, vol. 13, no. 4, pp. 854-866, July 2002.
[43] J. FisherIII, T. Darrell, W. Freeman, and P. Viola, "Learning Joint Statistical Models for Audio-Visual Fusion and Segregation," Proc. Advances in Neural Information Processing Systems, pp. 772-778, 2001.
[44] G. Qi, C. Aggarwal, and T. Huang, "Towards Semantic Knowledge Propagation from Text Corpus to Web Images," Proc. ACM Int'l Conf. World Wide Web, pp. 297-306, 2011.
[45] D. Li, N. Dimitrova, M. Li, and I. Sethi, "Multimedia Content Processing through Cross-Modal Association," Proc. ACM Int'l Conf. Multimedia, pp. 604-611, 2003.
[46] H. Zhang, Y. Zhuang, and F. Wu, "Cross-Modal Correlation Learning for Clustering on Image-Audio Dataset," Proc. ACM Int'l Conf. Multimedia, pp. 273-276, 2007.
[47] M. Slaney, "Semantic-Audio Retrieval," Proc. IEEE Int'l Conf. Acoustics Speech, and Signal Processing, vol. 4, pp. 4108-4111, 2002.
[48] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang, "Ranking with Local Regression and Global Alignment for Cross Media Retrieval," Proc. ACM Int'l Conf. Multimedia, pp. 175-184, 2009.
[49] Y. Zhuang, Y. Yang, and F. Wu, "Mining Semantic Correlation of Heterogeneous Multimedia Data for Cross-Media Retrieval," IEEE Trans. Multimedia, vol. 10, no. 2, pp. 221-229, Feb. 2008.
[50] Y. Zhuang, Y. Yang, F. Wu, and Y. Pan, "Manifold Learning Based Cross-Media Retrieval: A Solution to Media Object Complementary Nature," J. VLSI Signal Processing Systems, vol. 46, no. 2, pp. 153-164, 2007.
[51] Y. Yang, Y. Zhuang, F. Wu, and Y. Pan, "Harmonizing Hierarchical Manifolds for Multimedia Document Semantics Understanding and Cross-Media Retrieval," IEEE Trans. Multimedia, vol. 10, no. 3, pp. 437-446, Apr. 2008.
[52] V. Mahadevan, C.W. Wong, J.C. Pereira, T.T. Liu, N. Vasconcelos, and L.K. Saul, "Maximum Covariance Unfolding: Manifold Learning for Bimodal Data," Proc. Advances in Neural Information Processing Systems, vol. 24, pp. 918-926, 2011.
[53] A. Vinokourov, J. Shawe-Taylor, and N. Cristianini, "Inferring a Semantic Representation of Text Via Cross-Language Correlation Analysis," Proc. Advances in Neural Information Processing Systems, vol. 15, pp. 1473-1480, 2003.
[54] W. Hsu, T. Mei, and R. Yan, "Knowledge Discovery over Community-Sharing Media: From Signal to Intelligence," Proc. IEEE Int'l Conf. Multimedia and Expo, pp. 1448-1451, 2009.
[55] T. Mei, W. Hsu, and J. Luo, "Knowledge Discovery from Community-Contributed Multimedia," IEEE Trans. Multimedia, vol. 17, no. 4, pp. 16-17, Oct.-Dec. 2010.
[56] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis," J. Am. Soc. for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[57] I. Jolliffe, Principal Component Analysis. John Wiley & Sons, 2005.
[58] H. Hotelling, "Relations between Two Sets of Variates," Biometrika, vol. 28, pp. 321-377, 1936.
[59] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, "LIBLINEAR: A Library for Large Linear Classification," J. Machine Learning Research, vol. 9, pp. 1871-1874, 2008.
[60] C.-C. Chang and C.-J. Lin, "LIBSVM: A Library for Support Vector Machines," ACM Trans. Intelligent Systems and Technology, vol. 2, pp. 27:1-27:27, 2011.
[61] M.J. Saberian and N. Vasconcelos, "Multiclass Boosting: Theory and Algorithms," Proc. Advances in Neural Information Processing Systems, vol. 24, pp. 2124-2132, 2011.
[62] G. Griffin, A. Holub, and P. Perona, "The Caltech-256," technical report, Caltech, 2006.
[63] C. Manning, P. Raghavan, and H. Schütze, An Introduction to Information Retrieval. Cambridge Univ. Press, 2008.
[64] G. Doyle and C. Elkan, "Accounting for Word Burstiness in Topic Models," Proc. ACM Int'l Conf. Machine Learning, pp. 281-288, 2009.
[65] M. Swain and D. Ballard, "Color Indexing," Int'l J. Computer Vision, vol. 7, no. 1, pp. 11-32, 1991.
[66] S. Boughorbel, J. Tarel, and N. Boujemaa, "Generalized Histogram Intersection Kernel for Image Recognition," Proc. IEEE Int'l Conf. Image Processing, vol. 3, pp. 161-164, 2005.
[67] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, "Collecting Image Annotations Using Amazon's Mechanical Turk," Proc. NAACL HLT Workshop Creating Speech and Language Data with Amazon's Mechanical Turk, pp. 139-147, 2010.
[68] J.C. Pereira and N. Vasconcelos, "On the Regularization of Image Semantics by Modal Expansion," Proc. IEEE Conf. Computer Vision on Pattern Recognition, pp. 3093-3099, 2012.
40 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool