This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Weakly Supervised Learning of Interactions between Humans and Objects
March 2012 (vol. 34 no. 3)
pp. 601-614
C. Schmid, LEAR team, INRIA Rhone-Alyes, St. Ismier, France
A. Prest, Comput. Vision Lab., ETH Zurich, Zurich, Switzerland
V. Ferrari, Comput. Vision Lab., ETH Zurich, Zurich, Switzerland
We introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: We first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e., the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action data set from [1], the PASCAL Action 2010 data set [2], and a new human-object interaction data set.

[1] A. Gupta, A. Kembhavi, and L. Davis, “Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1775-1789, Oct. 2009.
[2] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman “The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results,” http://www.pascal-network.org/ challenges/ VOC/voc2010/workshopindex.html, 2010.
[3] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” Proc. 17th Int'l Conf. Pattern Recognition, 2004.
[4] I. Laptev and P. Perez, “Retrieving Actions in Movies,” Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[5] K. Mikolajczyk and H. Uemura, “Action Recognition with Motion-Appearance Vocabulary Forest,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[6] J. Sullivan and S. Carlsson, “Recognizing and Tracking Human Action,” Proc. Seventh European Conf. Computer Vision, 2002.
[7] N. Ikizler-Cinbis, G. Cinbis, and S. Sclaroff, “Learning Actions from the Web,” Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proc. Second IEEE Joint Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.
[9] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, “Learning Realistic Human Actions from Movies,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[10] G. Willems, J.H. Becker, T. Tuytelaars, and L. van Gool, “Exemplar-Based Action Recognition in Video,” Proc. British Machine Vision Conf., 2009.
[11] C. Thurau and V. Hlavac, “Pose Primitive Based Human Action Recognition in Videos or Still Images,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[12] B. Yao and L. Fei-Fei, “Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[13] B. Yao and L. Fei-Fei, “Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[14] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative Models for Static Human-Object Interactions,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition Workshops, 2010.
[15] C. Desai, D. Ramanan, and C. Fowlkes, “Discriminative Models for Multi-Class Object Layout,” Proc. 12th IEEE Int'l Conf. Computer Vision, 2007.
[16] A. Gupta, T. Chen, F. Chen, D. Kimber, L. Davis, “Context and Observation Driven Latent Variable Model for Human Pose Estimation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[17] N. Ikizler-Cinbis and S. Sclaroff, “Object, Scene and Actions: Combining Multiple Features for Human Action Recognition,” Proc. 11th European Conf. Computer Vision, 2010.
[18] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition by Unsupervised Scale-Invariant Learning,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2003.
[19] J. Winn, A. Criminisi, and T. Minka, “Object Categorization by Learned Universal Visual Dictionary,” Proc. 10th IEEE Int'l Conf. Computer Vision, 2005.
[20] T. Deselaers, B. Alexe, and V. Ferrari, “Localizing Objects While Learning Their Appearance,” Proc. 11th European Conf. Computer Vision, 2010.
[21] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, “Object Detection with Discriminatively Trained Part Based Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, Sept. 2010.
[22] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” http://www.pascal-network.org/ challenges/ VOC/voc2007/workshopindex.html , 2007.
[23] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, “Progressive Search Space Reduction for Human Pose Estimation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[24] Y. Rodriguez, “Face Detection and Verification Using Local Binary Patterns,” PhD thesis, EPF Lausanne, 2006.
[25] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2001.
[26] G. Heusch, Y. Rodriguez, and S. Marcel, “Local Binary Patterns As an Image Preprocessing for Face Authentication,” Proc. Seventh IEEE Int'l Conf. Automatic Face and Gesture Recognition, 2006.
[27] D. Comaniciu, V. Ramesh, and P. Meer, “The Variable Bandwidth Mean Shift and Data-Driven Scale Selection,” Proc. Eighth IEEE Int'l Conf. Computer Vision, 2001.
[28] M. Eichner and V. Ferrari, “Better Appearance Models for Pictorial Structures,” Proc. British Machine Vision Conf., 2009.
[29] R. Fergus and P. Perona, “Caltech Object Category Datasets,” http://www.vision.caltech.edu/html-files archive.html, 2003.
[30] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman “The PASCAL Visual Object Classes Challenge 2008 (VOC2008) Results,” http://www.pascal-network.org/ challenges/ VOC/voc2008/workshopindex.html , 2008.
[31] B. Alexe, T. Deselaers, and V. Ferrari, “What Is an Object?” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[32] V. Kolmogorov, “Convergent Tree-Reweighted Message Passing for Energy Minimization,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1568-1583, Oct. 2006.
[33] H. Bay, A. Ess, T. Tuytelaars, and L. van Gool, “SURF: Speeded Up Robust Features,” Computer Vision and Image Understanding, vol. 110, pp. 346-359, 2008.
[34] Z. Botev, “Nonparametric Density Estimation via Diffusion Mixing,” The Univ. of Queensland, Postgraduate Series, Nov. 2007.
[35] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, “Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study,” Int'l J. Computer Vision, vol. 73, pp. 213-238, 2007.
[36] A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope,” Int'l J. Computer Vision, vol. 42, pp. 145-175, 2001.
[37] L.J. Li and L. Fei-Fei, “What, Where and Who? Classifying Event by Scene and Object Recognition,” Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[38] P. Gehler and S. Nowozin, “On Feature Combination for Multiclass Object Classification,” Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[39] M. Grubinger, P.D. Clough, H. Müller, and T. Deselaers, “The IAPR Benchmark: A New Evaluation Resource for Visual Information Systems,” Proc. Int'l Conf. Language Resources and Evaluation, 2006.
[40] R. Johansson and P. Nugues, “Dependency-Based Syntactic-Semantic Analysis with Propbank and Nombank,” Proc. 12th Conf. Computational Natural Language Learning, 2008.

Index Terms:
probability,gesture recognition,learning (artificial intelligence),object detection,action recognition,weakly supervised learning,still images,model learning,probabilistic model,human-object interaction,Humans,Detectors,Training,Face,Context modeling,Computational modeling,Support vector machines,object detection.,Action recognition,weakly supervised learning
Citation:
C. Schmid, A. Prest, V. Ferrari, "Weakly Supervised Learning of Interactions between Humans and Objects," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 601-614, March 2012, doi:10.1109/TPAMI.2011.158
Usage of this product signifies your acceptance of the Terms of Use.