The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - Sept. (2012 vol.34)
pp: 1691-1703
Bangpeng Yao , Comput. Sci. Dept., Stanford Univ., Stanford, CA, USA
Li Fei-Fei , Comput. Sci. Dept., Stanford Univ., Stanford, CA, USA
ABSTRACT
Detecting objects in cluttered scenes and estimating articulated human body parts from 2D images are two challenging problems in computer vision. The difficulty is particularly pronounced in activities involving human-object interactions (e.g., playing tennis), where the relevant objects tend to be small or only partially visible and the human body parts are often self-occluded. We observe, however, that objects and human poses can serve as mutual context to each other-recognizing one facilitates the recognition of the other. In this paper, we propose a mutual context model to jointly model objects and human poses in human-object interaction activities. In our approach, object detection provides a strong prior for better human pose estimation, while human pose estimation improves the accuracy of detecting the objects that interact with the human. On a six-class sports data set and a 24-class people interacting with musical instruments data set, we show that our mutual context model outperforms state of the art in detecting very difficult objects and estimating human poses, as well as classifying human-object interaction activities.
INDEX TERMS
Humans, Context, Estimation, Context modeling, Object detection, Biological system modeling, Sports equipment, conditional random field., Mutual context, action recognition, human pose estimation, object detection
CITATION
Bangpeng Yao, Li Fei-Fei, "Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.34, no. 9, pp. 1691-1703, Sept. 2012, doi:10.1109/TPAMI.2012.67
REFERENCES
[1] I. Biederman, R. Mezzanotte, and J. Rabinowitz, "Scene Perception: Detecting and Judging Objects Undergoing Relational Violations," Cognitive Psychology, vol. 14, pp. 143-177, 1982.
[2] A. Oliva and A. Torralba, "The Role of Context in Object Recognition," Trends in Cognitive Sciences, vol. 11, no. 12, pp. 520-527, 2007.
[3] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, "Objects in Context," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[4] G. Heitz and D. Koller, "Learning Spatial Context: Using Stuff to Find Things," Proc. European Conf. Computer Vision, 2008.
[5] S. Divvala, D. Hoiem, J. Hays, A. Efros, and M. Hebert, "An Empirical Study of Context in Object Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[6] K. Murphy, A. Torralba, and W. Freeman, "Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes," Proc. Advances in Neural Information Processing Systems, 2003.
[7] M. Marszalek, I. Laptev, and C. Schmid, "Actions in Context," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[8] J. Shotton, J. Winn, C. Rother, and A. Criminisi, "TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation," Proc. European Conf. Computer Vision, 2006.
[9] M. Everingham, L.V. Gool, C. Williams, J. Winn, and A. Zisserman, "The PASCAL VOC2008 Results," 2008.
[10] C. Desai, D. Ramanan, and C. Fowlkes, "Discriminative Models for Multi-Class Object Layout," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[11] H. Harzallah, F. Jurie, and C. Schmid, "Combining Efficient Object Localization and Image Classification," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[12] B. Leibe, A. Leonardis, and B. Schiele, "Combined Object Categorization and Segmentation with an Implicit Shape Model," Proc. ECCV Workshop Statistical Learning in Computer Vision, 2004.
[13] J. Henderson, "Human Gaze Control during Real-World Scene Perception," Trends in Cognitive Sciences, vol. 7, no. 11, pp. 498-504, 2003.
[14] A. Gupta, A. Kembhavi, and L. Davis, "Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1775-1789, Oct. 2009.
[15] B. Yao and L. Fei-Fei, "Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[16] B. Yao, A. Khosla, and L. Fei-Fei, "Classifying Actions and Measuring Action Similarity by Modeling the Mutual Context of Objects and Human Poses," Proc. Int'l Conf. Machine Learning, 2011.
[17] D. Bub and M. Masson, "Gestural Knowledge Evoked by Objects as Part of Conceptual Representations," Aphasiology, vol. 20, pp. 1112-1124, 2006.
[18] H. Helbig, M. Graf, and M. Kiefer, "The Role of Action Representation in Visual Object," Experimental Brain Research, vol. 174, pp. 221-228, 2006.
[19] P. Bach, G. Knoblich, T. Gunter, A. Friederici, and W. Prinz, "Action Comprehension: Deriving Spatial and Functional Relations," J. Experimental Psychology: Human Perception and Performance, vol. 31, no. 3, pp. 465-479, 2005.
[20] A. Efros, A. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," Proc. Ninth IEEE Int'l Conf. Computer Vision, 2003.
[21] I. Laptev, "On Space-Time Interest Points," Int'l J. Computer Vision, vol. 64, nos. 2/3, pp. 107-123, 2005.
[22] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos in the Wild," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[23] J. Niebles, C. Chen, and L. Fei-Fei, "Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification," Proc. European Conf. Computer Vision, 2010.
[24] P. Felzenszwalb and D. Huttenlocher, "Pictorial Structures for Object Recognition," Int'l J. Computer Vision, vol. 61, no. 1, pp. 55-79, 2005.
[25] D. Ramanan, "Learning to Parse Images of Articulated Objects," Proc. Advances in Neural Information Processing Systems, 2006.
[26] M. Andriluka, S. Roth, and B. Schiele, "Pictorial Structures Revisited: People Detection and Articulated Pose Estimation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[27] M. Eichner and V. Ferrari, "Better Appearance Models for Pictorial Structures," Proc. British Machine Vision Conference, 2009.
[28] B. Sapp, A. Toshev, and B. Taskar, "Cascade Models for Articulated Pose Estimation," Proc. European Conf. Computer Vision, 2010.
[29] Y. Yang and D. Ramanan, "Articulated Pose Estimation with Flexible Mixture-of-Parts," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[30] X. Ren, A. Berg, and J. Malik, "Recovering Human Body Configurations Using Pairwise Constraints between Parts," Proc. 10th IEEE Int'l Conf. Computer Vision, 2005.
[31] Y. Wang and G. Mori, "Multiple Tree Models for Occlusion and Spatial Constraints in Human Pose Estimation," Proc. European Conf. Computer Vision, 2008.
[32] L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001.
[33] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake, "Real-Time Human Pose Recognition in Parts from Single Depth Images," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[34] P. Viola and M. Jones, "Robust Real-Time Object Detection," Int'l J. Computer Vision, vol. 57, no. 2, pp. 137-154, 2001.
[35] C. Lampert, M. Blaschko, and T. Hofmann, "Beyond Sliding Windows: Object Localization by Efficient Subwindow Search," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[36] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, "Object Detection with Discriminatively Trained Part-Based Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, Sept. 2010.
[37] D. Hoiem, A. Efros, and M. Hebert, "Putting Objects in Perspective," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
[38] A. Gupta, T. Chen, F. Chen, D. Kimber, and L. Davis, "Context and Observation Driven Latent Variable Model for Human Pose Estimation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[39] H. Kjellstrom, D. Kragic, and M. Black, "Tracking People Interacting with Objects," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[40] B. Rosenhahn, C. Schmaltz, T. Brox, J. Weickert, and H.-P. Seidel, "Staying Well Grounded in Markerless Motion Capture," Proc. Symp. German Assoc. for Pattern Recognition, 2008.
[41] M. Brubaker, L. Sigal, and D. Fleet, "Estimating Contact Dynamics," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[42] C. Desai, D. Ramanan, and C. Fowlkes, "Discriminative Models for Static Human-Object Interactions," Proc. IEEE CS Conf. Computer Vision and Computer Recognition Workshops, 2010.
[43] W. Yang, Y. Wang, and G. Mori, "Recognizing Human Actions from Still Images with Latent Poses," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[44] V. Delaitre, I. Laptev, and J. Sivic, "Recognizing Human Actions in Still Images: A Study of Bag-of-Features and Part-Based Representations," Proc. British Machine Vision Conf., 2010.
[45] S. Maji, L. Bourdev, and J. Malik, "Action Recognition from a Distributed Representation of Pose and Appearance," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[46] A. Prest, C. Schmid, and V. Ferrari, "Weakly Supervised Learning of Interaction between Humans and Objects," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 601-614, Mar. 2012.
[47] L. Jie, B. Caputo, and V. Ferrari, "Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation," Proc. Advances in Neural Information Processing Systems, 2009.
[48] V. Singh, F. Khan, and R. Nevatia, "Multiple Pose Context Trees for Estimating Human Pose in Object Context," Proc. IEEE CS Conf. Computer Vision and Computer Recognition Workshops, 2010.
[49] M. Sadeghi and A. Farhadi, "Recognition Using Visual Phrases," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[50] B. Yao and L. Fei-Fei, "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[51] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. Int'l Conf. Machine Learning, 2001.
[52] J. Liebelt, C. Schmid, and K. Schertler, "Viewpoint-Independent Object Class Detection Using 3D Feature Maps," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[53] Y. Wang, H. Jiang, M. Drew, Z.-N. Li, and G. Mori, "Unsupervised Discovery of Action Classes," Proc. IEEE CS Conf. Computer Vision and Computer Recognition, 2006.
[54] L. Bourdev and J. Malik, "Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[55] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2005.
[56] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
[57] D. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[58] V. Ferrari, M. Marín-Jiménez, and A. Zisserman, "Progressive Search Space Reduction for Human Pose Estimation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
44 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool