| | This Article | |
| |
| |
| | Share | |
| |
| |
| | Bibliographic References | |
| |
| |
| | Add to: | |
| |
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
| |
| | Search | |
| |
| |
| | |
A Thousand Words in a Scene
September 2007 (vol. 29 no. 9)
pp. 1575-1589
This paper presents a novel approach for visual scene modeling and classification, investigating the combined use of text modeling methods and local invariant features. Our work attempts to elucidate (1) whether a text-like bag-of-visterms representation (histogram of quantized local visual features) is suitable for scene (rather than object) classification, (2) whether some analogies between discrete scene representations and text documents exist, and (3) whether unsupervised, latent space models can be used both as feature extractors for the classification task and to discover patterns of visual co-occurrence. Using several data sets, we validate our approach, presenting and discussing experiments on each of these issues. We first show, with extensive experiments on binary and multi-class scene classification tasks using a 9500-image data set, that the bag-of-visterms representation consistently outperforms classical scene classification approaches. In other data sets we show that our approach competes with or outperforms other recent, more complex, methods. We also show that Probabilistic Latent Semantic Analysis (PLSA) generates a compact scene representation, discriminative for accurate classification, and more robust than the bag-of-visterms representation when less labeled training data is available. Finally, through aspect-based image ranking experiments, we show the ability of PLSA to automatically extract visually meaningful scene patterns, making such representation useful for browsing image collections.
[1] 1575 R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.[2] K. Barnard, P. Duygulu, N. Freitas, D. Forsyth, D. Blei, and M.I. Jordan, “Matching Words and Pictures,” J. Machine Learning Research, vol. 3, pp. 1107-1135, 2003.[3] D. Blei, Y. Andrew, and M. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research, vol. 3, pp. 993-1020, 2003.[4] D. Blei and M. Jordan, “Modeling Annotated Data,” Proc. 26th Int'l Conf. Research and Development in Information Retrieval, Aug. 2003.[5] M.R. Boutell, J. Luo, X. Shen, and C.M. Brown, “Learning Multi-Label Scene Classification,” Pattern Recognition, vol. 37, no. 9, pp.1757-1771, Sept. 2004.[6] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998.[7] G. Dorko and C. Schmid, “Selection of Scale Invariant Parts for Object Class Recognition,” Proc. IEEE Int'l Conf. Computer Vision, Oct. 2003.[8] J. Fauqueur and N. Boujemaa, “New Image Retrieval Paradigm: Logical Composition of Region Categories,” Proc. Int'l Conf. Image Processing, Oct. 2003.[9] L. Fei-Fei, R. Fergus, and P. Perona, “A Bayesian Approach to Unsupervised One-Shot Learning of Object Categories,” Proc. IEEE Int'l Conf. Computer Vision, Oct. 2003.[10] L. Fei-Fei, R. Fergus, and P. Perona, “Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories,” Proc. IEEE Int'l Conf. Computer Vision, Workshop Generative-Model Based Vision, June 2004.[11] L. Fei-Fei and P. Perona, “A Bayesian Hierarchical Model for Learning Natural Scene Categories,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, June 2005.[12] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition by Unsupervised Scale-Invariant Learning,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, June 2003.[13] M. Gorkani and R. Picard, “Texture Orientation for Sorting Photos at Glance,” Proc. Int'l Conf. Pattern Recognition, Sept. 1994.[14] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, vol. 42, pp. 177-196, 2001.[15] S. Kumar and M. Herbert, “Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification,” Proc. IEEE Int'l Conf. Computer Vision, Oct. 2003.[16] S. Kumar and M. Herbert, “Man-Made Structure Detection in Natural Images Using a Causal Multiscale Random Field,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, June 2003.[17] B. Leibe and B. Schiele, “Interleaved Object Categorization and Segmentation,” Proc. British Machine Vision Conf., Sept. 2003.[18] J.-H. Lim and J.S. Jin, “Semantics Discovery for Image Indexing,” Proc. European Conf. Computer Vision (ECCV '04), May 2004.[19] D.G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.[20] K. Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, June 2003.[21] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool, “A Comparison of Affine Region Detectors,” Int'l J. Computer Vision, vol. 65, pp. 43-72, 2005.[22] F. Monay and D. Gatica-Perez, “On Image Auto-Annotation with Latent Space Models,” Proc. ACM Int'l Conf. Multimedia, Nov. 2003.[23] F. Monay and D. Gatica-Perez, “PLSA-Based Image Auto-Annotation: Constraining the Latent Space,” Proc. ACM Int'l Conf. Multimedia, Oct. 2004.[24] F. Monay, P. Quelhas, D. Gatica-Perez, and J.-M. Odobez, “Constructing Visual Models with a Latent Space Approach,” Proc. Pattern Analysis, Statistical Modelling, and Computational Learning (PASCAL) Workshop Subspace, Latent Structure and Feature Selection Techniques: Statistical and Optimisation Perspectives, Feb. 2005.[25] M. Naphade and T. Huang, “A Probabilistic Framework for Semantic Video Indexing, Filtering and Retrieval,” IEEE Trans. Multimedia, vol. 3, no. 1, pp. 141-151, Mar. 2001.[26] A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope,” Int'l J. Computer Vision, vol. 42, pp. 145-175, 2001.[27] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer, “Weak Hypotheses and Boosting for Generic Object Detection and Recognition,” Proc. IEEE European Conf. Computer Vision, May 2004.[28] S. Paek and S.-F. Chang, “A Knowledge Engineering Approach for Image Classification Based on Probabilistic Reasoning Systems,” Proc. IEEE Int'l Conf. Multimedia and Expo, Aug. 2000.[29] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. Van Gool, “Modeling Scenes with Local Descriptors and Latent Aspects,” Proc. IEEE Int'l Conf. Computer Vision, Oct. 2005.[30] F. Schaffalitzky and A. Zisserman, “Multi-View Matching for Unordered Image Sets,” Proc. European Conf. Computer Vision, 2002.[31] N. Serrano, A. Savakis, and J. Luo, “A Computationally Efficient Approach to Indoor/Outdoor Scene Classification,” Proc. Int'l Conf. Pattern Recognition, Aug. 2002.[32] H. Shao, T. Svoboda, V. Ferrari, T. Tuytelaars, and L. Van Gool, “Fast Indexing for Image Retrieval Based on Local Appearance with Re-Ranking,” Proc. IEEE Int'l Conf. Image Processing, Sept. 2003.[33] J. Sivic, B. C. Russell, A. A. Efros, A. Zisserman, and W. T. Freeman, “Discovering Object Categories in Image Collections,” Proc. IEEE Int'l Conf. Computer Vision, Oct. 2005.[34] J. Sivic and A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos,” Proc. IEEE Int'l Conf. Computer Vision, Oct. 2003.[35] J. Sivic and A. Zisserman, “Video Data Mining Using Configurations of Viewpoint Invariant Regions,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, June 2004.[36] A. Smeaton and P. Over, “The TREC-2002 Video Track Report,” Proc. Text REtrieval Conf., Nov. 2002.[37] A.W. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-Based Image Retrieval at the End of the Early Years,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1349-1380, Dec. 2000.[38] M. Szummer and R.W. Picard, “Indoor-Outdoor Image Classification,” Proc. IEEE Int'l Workshop Content-Based Access of Image and Video Databases (CAIVD '98) in Int'l Conf. Computer Vision, Jan. 1998.[39] A.B. Torralba, K.P. Murphy, W.T. Freeman, and M.A. Rubin, “Context-Based Vision System for Place and Object Recognition,” Proc. IEEE Int'l Conf. Computer Vision, Oct. 2003.[40] T. Tuytelaars and L. Van Gool, “Content-Based Image Retrieval Based on Local Affinely Invariant Regions,” Proc. Visual '99, June 1999.[41] A. Vailaya, M. Figueiredo, A. Jain, and H.J. Zhang, “Image Classification for Content-Based Indexing,” IEEE Trans. Image Processing, vol. 10, no. 1, pp. 117-130, 2001.[42] J. Vogel and B. Schiele, “Natural Scene Retrieval Based on a Semantic Modeling Step,” Proc. Int'l Conf. Image and Video Retrieval, July 2004.[43] J. Weston and C. Watkins, Multi-Class Support Vector Machines, Technical Report CSD-TR-98-04, Dept. of Computer Science, Royal Holloway, Univ. of London, May 1998.[44] J. Willamowski, D. Arregui, G. Csurka, C.R. Dance, and L. Fan, “Categorizing Nine Visual Classes Using Local Appearance Descriptors,” Proc. Learning for Adaptable Visual Systems (LAVS) Workshop Int'l Conf. Pattern Recognition (ICPR '04), Aug. 2004.[45] R. Zhang and Z. Zhang, “Hidden Semantic Concept Discovery in Region Based Image Retrieval,” Proc. Conf. Computer Vision and Pattern Recognition, June 2004.
Index Terms:
Image representation, scene classification, object recognition, quantized local descriptors, latent aspect modeling
Citation:
Pedro Quelhas, Florent Monay, Jean-Marc Odobez, Daniel Gatica-Perez, Tinne Tuytelaars, "A Thousand Words in a Scene," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 29, no. 9, pp. 1575-1589, June 2007, doi:10.1109/TPAMI.2007.1155