This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Reading between the Lines: Object Localization Using Implicit Cues from Image Tags
June 2012 (vol. 34 no. 6)
pp. 1145-1158
Sung Ju Hwang, Dept. of Comput. Sci., Univ. of Texas at Austin, Austin, TX, USA
K. Grauman, Dept. of Comput. Sci., Univ. of Texas at Austin, Austin, TX, USA
Current uses of tagged images typically exploit only the most explicit information: the link between the nouns named and the objects present somewhere in the image. We propose to leverage “unspoken” cues that rest within an ordered list of image tags so as to improve object localization. We define three novel implicit features from an image's tags-the relative prominence of each object as signified by its order of mention, the scale constraints implied by unnamed objects, and the loose spatial links hinted at by the proximity of names on the list. By learning a conditional density over the localization parameters (position and scale) given these cues, we show how to improve both accuracy and efficiency when detecting the tagged objects. Furthermore, we show how the localization density can be learned in a semantic space shared by the visual and tag-based features, which makes the technique applicable for detection in untagged input images. We validate our approach on the PASCAL VOC, LabelMe, and Flickr image data sets, and demonstrate its effectiveness relative to both traditional sliding windows as well as a visual context baseline. Our algorithm improves state-of-the-art methods, successfully translating insights about human viewing behavior (such as attention, perceived importance, or gaze) into enhanced object detection.

[1] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, "The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results," http://www.pascal-network.org/ challenges/ VOC/voc2007/workshopindex.html , 2007.
[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "Imagenet: A Large-Scale Hierarchical Image Database," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[3] R. Fergus, P. Perona, and A. Zisserman, "A Visual Category Filter for Google Images," Proc. European Conf. Computer Vision, 2004.
[4] L. Li, G. Wang, and L. Fei-Fei, "Optimol: Automatic Online Picture Collection via Incremental Model Learning," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
[5] F. Schroff, A. Criminisi, and A. Zisserman, "Harvesting Image Databases from the Web," Proc. IEEE Int'l Conf. Computer Vision, 2007.
[6] S. Vijayanarasimhan and K. Grauman, "Keywords to Visual Categories: Multiple-Instance Learning for Weakly Supervised Object Categorization," Proc. Conf. Computer Vision and Pattern Recognition, 2008.
[7] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. Jordan, "Matching Words and Pictures," J. Machine Learning Research, vol. 3, pp. 1107-1135, 2003.
[8] T. Berg, A. Berg, J. Edwards, and D. Forsyth, "Who's in the Picture?" Proc. Neural Information Processing Systems, 2004.
[9] M. Jamieson, A. Fazly, S. Dickinson, S. Stevenson, and S. Wachsmuth, "Learning Structured Appearance Models from Captioned Images of Cluttered Scenes," Proc. IEEE Int'l Conf. Computer Vision, 2007.
[10] M. Ames and M. Naaman, "Why We Tag: Motivations for Annotation in Mobile and Online Media," Proc. SIGCHI Conf. Human Factors in Computing Systems, 2007.
[11] L. Elazary and L. Itti, "Interesting Objects Are Visually Salient," J. Vision, vol. 8, no. 3, pp. 1-15, 2008.
[12] W. Einhauser, M. Spain, and P. Perona, "Objects Predict Fixations Better than Early Saliency," J. Vision, vol. 8, no. 14, pp. 1-26, 2008.
[13] K. Murphy, A. Torralba, D. Eaton, and W. Freeman, "Object Detection and Localization Using Local and Global Features," Towards Category-Level Object Recognition. vol. 1, pp. 382-400, 2006.
[14] S.J. Hwang and K. Grauman, "Reading between the Lines: Object Localization Using Implicit Cues from Image Tags," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[15] A. Torralba, "Contextual Priming for Object Detection," Int'l J. Computer Vision, vol. 53, no. 2, pp. 169-191, 2003.
[16] P. Viola and M. Jones, "Rapid Object Detection Using a Boosted Cascade of Simple Features," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2001.
[17] C. Lampert, M. Blaschko, and T. Hofmann, "Beyond Sliding Windows: Object Localization by Efficient Subwindow Search," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[18] D. Hoiem, A. Efros, and M. Hebert, "Putting Objects in Perspective," Int'l J. Computer Vision, vol. 80, no. 1, pp. 3-15, Oct. 2008.
[19] A. Gupta, J. Shi, and L. Davis, "A Shape Aware Model for Semi-Supervised Learning of Objects and Its Context," Proc. Neural Information Processing Systems, 2008.
[20] S. Divvala, D. Hoiem, J. Hays, A. Efros, and M. Hebert, "An Empirical Study of Context in Object Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[21] C. Galleguillos, A. Rabinovich, and S. Belongie, "Object Categorization Using Co-Occurrence, Location and Appearance," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[22] E. Sudderth, A. Torralba, W. Freeman, and A. Willsky, "Describing Visual Scenes Using Transformed Objects and Parts," Int'l J. Computer Vision, vol. 77, nos. 1-3, pp. 291-330, May 2008.
[23] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, "Multi-Class Segmentation with Relative Location Prior," Int'l J. Computer Vision, vol. 80, no. 3, pp. 300-316, Apr. 2008.
[24] C. Desai, D. Ramanan, and C. Fowlkes, "Discriminative Models for Multi-Class Object Layout," Proc. IEEE Int'l Conf. Computer Vision, 2009.
[25] V. Lavrenko, R. Manmatha, and J. Jeon, "A Model for Learning the Semantics of Pictures," Proc. Neural Information Processing Systems, 2003.
[26] F. Monay and D. Gatica-Perez, "On Image Auto-Annotation with Latent Space Models," Proc. ACM Int'l Conf. Multimedia, 2003.
[27] L.-J. Li, R. Socher, and L. Fei-Fei, "Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[28] A. Gupta and L. Davis, "Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers," Proc. European Conf. Computer Vision, 2008.
[29] S. Akaho, "A Kernel Method for Canonical Correlation Analysis," Proc. Int'l Meeting of Psychometric Soc., 2001.
[30] C. Fyfe and P. Lai, "Kernel and Nonlinear Canonical Correlation Analysis," Int'l J. Neural Systems, vol. 10, pp. 365-374, 2001.
[31] D.R. Hardoon, S. Szedmak, and J. Shawe-Taylor, "Canonical Correlation Analysis: An Overview with Application to Learning Methods," Neural Computation, vol. 16, no. 12, pp. 2639-2664, 2004.
[32] D. Hardoon and J. Shawe-Taylor, "KCCA for Different Level Precision in Content-Based Image Retrieval," Proc. Third Int'l Workshop Content-Based Multimedia Indexing, 2003.
[33] O. Yakhnenko and V. Honavar, "Multiple Label Prediction for Image Annotation with Multiple Kernel Correlation Models," Proc. Workshop Visual Context Learning, 2009.
[34] M.B. Blaschko and C.H. Lampert, "Correlational Spectral Clustering," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[35] S.J. Hwang and K. Grauman, "Accounting for the Relative Importance of Objects in Image Retrieval," Proc. British Machine Vision Conf., 2010.
[36] J. Wolfe and T. Horowitz, "What Attributes Guide the Deployment of Visual Attention and How Do They Do It?" Neuroscience, vol. 5, pp. 495-501, 2004.
[37] B. Tatler, R. Baddeley, and I. Gilchrist, "Visual Correlates of Fixation Selection: Effects of Scale and Time," Vision Research, vol. 45, pp. 643-659, 2005.
[38] M. Spain and P. Perona, "Some Objects Are More Equal than Others: Measuring and Predicting Importance," Proc. European Conf. Computer Vision, 2008.
[39] T. Kadir and M. Brady, "Saliency, Scale and Image Description," Int'l J. Computer Vision, vol. 45, no. 2, pp. 83-105, June 2001.
[40] L. von Ahn and L. Dabbish, "Labeling Images with a Computer Game," Proc. SIGCHI Conf. Human Factors in Computing Systems, 2004.
[41] M. Peot and R. Shachter, "Learning from What You Don't Observe," Proc. Uncertainty in Artificial Intelligence, 1998.
[42] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2005.
[43] P. Felzenszwalb, D. McAllester, and D. Ramanan, "A Discriminatively Trained Multiscale Deformable Part Model," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[44] C. Bishop, "Mixture Density Networks," technical report, Neural Computing Research Group, Aston Univ., 1994.
[45] H. Hotelling, "Relations between Two Sets of Variants," Biometrika, vol. 28, pp. 321-377, 1936.
[46] A. Sorokin and D. Forsyth, "Utility Data Annotation with Amazon Mechanical Turk," Proc. IEEE CVPR Workshop Internet Vision, 2008.
[47] B. Russell, A. Torralba, K. Murphy, and W. Freeman, "Labelme: A Database and Web-Based Tool for Image Annotation," technical report, MIT, 2005.
[48] C. Bishop, Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995.

Index Terms:
social networking (online),feature extraction,object detection,object detection,object localization,implicit cues,image tags,implicit features,object relative prominence,scale constraints,loose spatial links,conditional density,localization parameters,localization density,semantic space,visual-based features,tag-based features,PASCAL VOC image data sets,LabelMe image data sets,Flickr image data sets,sliding windows,visual context baseline,Visualization,Feature extraction,Semantics,Detectors,Correlation,Training,Context,context.,Object detection,object recognition,image tags
Citation:
Sung Ju Hwang, K. Grauman, "Reading between the Lines: Object Localization Using Implicit Cues from Image Tags," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1145-1158, June 2012, doi:10.1109/TPAMI.2011.190
Usage of this product signifies your acceptance of the Terms of Use.