This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Using Language to Learn Structured Appearance Models for Image Annotation
January 2010 (vol. 32 no. 1)
pp. 148-164
Michael Jamieson, University of Toronto, Toronto
Afsaneh Fazly, University of Toronto, Toronto
Suzanne Stevenson, University of Toronto, Toronto
Sven Dickinson, University of Toronto, Toronto
Sven Wachsmuth, Bielefeld University, Bielefeld
Given an unstructured collection of captioned images of cluttered scenes featuring a variety of objects, our goal is to simultaneously learn the names and appearances of the objects. Only a small fraction of local features within any given image are associated with a particular caption word, and captions may contain irrelevant words not associated with any image object. We propose a novel algorithm that uses the repetition of feature neighborhoods across training images and a measure of correspondence with caption words to learn meaningful feature configurations (representing named objects). We also introduce a graph-based appearance model that captures some of the structure of an object by encoding the spatial relationships among the local visual features. In an iterative procedure, we use language (the words) to drive a perceptual grouping process that assembles an appearance model for a named object. Results of applying our method to three data sets in a variety of conditions demonstrate that, from complex, cluttered, real-world scenes with noisy captions, we can learn both the names and appearances of objects, resulting in a set of models invariant to translation, scale, orientation, occlusion, and minor changes in viewpoint or articulation. These named models, in turn, are used to automatically annotate new, uncaptioned images, thereby facilitating keyword-based image retrieval.

[1] K. Barnard, P. Duygulu, D. Forsyth, N. de Freitas, D. Blei, and M. Jordan, “Matching Words and Pictures,” J. Machine Learning Research, vol. 3, pp. 1107-1135, 2003.
[2] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, and D. Forsyth, “The Effects of Segmentation and Feature Choice in a Translation Model of Object Recognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2003.
[3] K. Barnard and Q. Fan, “Reducing Correspondence Ambiguity in Loosely Labeled Training Data,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[4] T.L. Berg, A.C. Berg, J. Edwards, M. Maire, R. White, Y.-W. Teh, E. Learned-Miller, and D. Forsyth, “Names and Faces in the News,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2004.
[5] T.L. Berg and D.A. Forsyth, “Animals on the Web,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
[6] D. Blei and M. Jordan, “Modeling Annotated Data,” Proc. ACM SIGIR, 2003.
[7] P. Carbonetto, N. de Freitas, and K. Barnard, “A Statistical Model for General Contextual Object Recognition,” Proc. European Conf. Computer Vision, 2004.
[8] G. Carneiro, A. Chan, P. Moreno, and N. Vasconcelos, “Supervised Learning of Semantic Classes for Image Annotation and Retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 394-410, Mar. 2007.
[9] G. Carneiro and A. Jepson, “Flexible Spatial Configuration of Local Image Features,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2089-2104, Dec. 2007.
[10] M.L. Cascia, S. Sethi, and S. Sclaroff, “Combining Textual and Visual Cues for Content-Based Image Retrieval on the World Wide Web,” Proc. IEEE Workshop Content-Based Access of Image and Video Libraries, June 1998.
[11] D.J. Crandall and D.P. Huttenlocher, “Weakly Supervised Learning of Part-Based Spatial Models for Visual Object Recognition,” Proc. European Conf. Computer Vision, 2006.
[12] P. Duygulu, K. Barnard, J.F.G. de Freitas, and D. Forsyth, “Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary,” Proc. European Conf. Computer Vision, vol. 4, pp. 97-112, 2002.
[13] S.L. Feng, R. Manmatha, and V. Lavrenko, “Multiple Bernoulli Relevance Models for Image and Video Annotation,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2004.
[14] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning Object Categories from Google's Image Search,” Proc. IEEE Int'l Conf. Computer Vision, 2005.
[15] M. Jamieson, S. Dickinson, S. Stevenson, and S. Wachsmuth, “Using Language to Drive the Perceptual Grouping of Local Image Features,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
[16] M. Jamieson, A. Fazly, S. Dickinson, S. Stevenson, and S. Wachsmuth, “Learning Structured Appearance Models from Captioned Images of Cluttered Scenes,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[17] J. Jeon, V. Lavrenko, and R. Manmatha, “Automatic Image Annotation and Retrieval Using Cross-Media Relevance Models,” Proc. ACM SIGIR, 2003.
[18] Y. Ke and R. Sukthankar, “PCA-SIFT: A More Distinctive Representation for Local Image Descriptors,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2004.
[19] J. Li and J. Wang, “Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1075-1088, Sept. 2003.
[20] D.G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[21] F. Monay and D. Gatica-Perez, “Modeling Semantic Aspects for Cross-Media Image Indexing,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 10, pp. 1802-1817, Oct. 2007.
[22] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object Retrieval with Large Vocabularies and Fast Spatial Matching,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[23] A. Quattoni, M. Collins, and T. Darrell, “Learning Visual Representations Using Images with Captions,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[24] I. Simon, N. Snavely, and S.M. Seitz, “Scene Summarization for Online Image Collections,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[25] J. Sivic and A. Zisserman, “Video Data Mining Using Configurations of Viewpoint Invariant Regions,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2004.

Index Terms:
Language-vision integration, image annotation, perceptual grouping, appearance models, object recognition.
Citation:
Michael Jamieson, Afsaneh Fazly, Suzanne Stevenson, Sven Dickinson, Sven Wachsmuth, "Using Language to Learn Structured Appearance Models for Image Annotation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 148-164, Jan. 2010, doi:10.1109/TPAMI.2008.283
Usage of this product signifies your acceptance of the Terms of Use.