The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.04 - April (2013 vol.35)
pp: 835-848
A. Prest , Comput. Vision Lab., ETH Zurich, Zurich, Switzerland
V. Ferrari , IPAB Inst., Univ. of Edinburgh, Edinburgh, UK
C. Schmid , LEAR Team, INRIA Rhone-Alpes, St. Ismier, France
ABSTRACT
We introduce an approach for learning human actions as interactions between persons and objects in realistic videos. Previous work typically represents actions with low-level features such as image gradients or optical flow. In contrast, we explicitly localize in space and track over time both the object and the person, and represent an action as the trajectory of the object w.r.t. to the person position. Our approach relies on state-of-the-art techniques for human detection [32], object detection [10], and tracking [39]. We show that this results in human and object tracks of sufficient quality to model and localize human-object interactions in realistic videos. Our human-object interaction features capture the relative trajectory of the object w.r.t. the human. Experimental results on the Coffee and Cigarettes dataset [25], the video dataset of [19], and the Rochester Daily Activities dataset [29] show that 1) our explicit human-object model is an informative cue for action recognition; 2) it is complementary to traditional low-level descriptors such as 3D--HOG [23] extracted over human tracks. We show that combining our human-object interaction features with 3D-HOG improves compared to their individual performance as well as over the state of the art [23], [29].
INDEX TERMS
Humans, Videos, Detectors, Training, Target tracking, Feature extraction,video analysis, Action recognition, human-object interaction
CITATION
A. Prest, V. Ferrari, C. Schmid, "Explicit Modeling of Human-Object Interactions in Realistic Videos", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 4, pp. 835-848, April 2013, doi:10.1109/TPAMI.2012.175
REFERENCES
[1] A. Bobick and J. Davis, "The Recognition of Human Movement Using Temporal Templates," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.
[2] M. Breitenstein, F. Reichlin, and L. Van Gool, "Robust Tracking-by-Detection Using a Detector Confidence Particle Filter," Proc. IEEE Int'l Conf. Computer Vision, 2009.
[3] T. Brox and J. Malik, "Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 500-513, Mar. 2011.
[4] N. Dalal and B. Triggs, "Histogram of Oriented Gradients for Human Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking Surveillance, 2005.
[6] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, "Automatic Annotation of Human Actions in Video," Proc. IEEE Int'l Conf. Computer Vision, 2009.
[7] A. Efros, A. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," Proc. IEEE Int'l Conf. Computer Vision, 2003.
[8] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, "The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results," http://www.pascal-network.org/ challenges/ VOC/voc2007/workshopindex.html, 2007.
[9] A. Fathi and G. Mori, "Action Recognition by Learning Mid-Level Motion Features," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[10] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, Sept. 2010.
[11] R. Fergus and P. Perona, "Caltech Object Category Datasets," http://www.vision.caltech.edu/html-files archive.html, 2003.
[12] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, "Progressive Search Space Reduction for Human Pose Estimation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[13] R. Filipovych and E. Ribeiro, "Recognizing Primitive Interactions by Exploring Actor-Object States," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[14] R. Filipovych and E. Ribeiro, "Robust Sequence Alignment for Actor-Object Interaction Recognition: Discovering Actor-Object States," Computer Vision and Image Understanding, vol. 115, pp. 177-193, 2011.
[15] A. Gaidon, Z. Harchaoui, and C. Schmid, "Actom Sequence Models for Efficient Action Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[16] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[17] H. Grabner and H. Bischof, "On-Line Boosting and Vision," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006.
[18] H. Grabner, C. Leistner, and H. Bischof, "Semi-Supervised On-Line Boosting for Robust Tracking," Proc. European Conf. Computer Vision, 2008.
[19] A. Gupta, A. Kembhavi, and L. Davis, "Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1775-1789, Oct. 2009.
[20] N. Ikizler and D.A. Forsyth, "Searching for Complex Human Activities with No Visual Examples," Int'l J. Computer Vision, vol. 80, pp. 337-357, 2008.
[21] N. Ikizler-Cinbis, G. Cinbis, and S. Sclaroff, "Learning Actions from the Web," Proc. IEEE Int'l Conf. Computer Vision, 2009.
[22] Z. Kalal, J. Matas, and K. Mikolajczyk, "P-n Learning: Bootstrapping Binary Classifiers by Structural Constraints," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[23] A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman, "Human Focused Action Localization in Video," Proc. Int'l Workshop Sign, Gesture, and Activity in Conjunction with European Conf. Computer Vision, 2010.
[24] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, "Learning Realistic Human Actions from Movies," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[25] I. Laptev and P. Perez, "Retrieving Actions in Movies," Proc. IEEE Int'l Conf. Computer Vision, 2007.
[26] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos in the Wild," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[27] S. Maji, A. Berg, and J. Malik, "Classification Using Intersection Kernel Support Vector Machines Is Efficient," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[28] P. Matikainen, M. Hebert, and R. Sukthankar, "Representing Pairwise Spatial and Temporal Relations for Action Recognition," Proc. European Conf. Computer Vision, 2010.
[29] R. Messing, C. Pal, and H. Kautz, "Activity Recognition Using the Velocity Histories of Tracked Keypoints," Proc. IEEE Int'l Conf. Computer Vision, 2009.
[30] K. Mikolajczyk and H. Uemura, "Action Recognition with Motion-Appearance Vocabulary Forest," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[31] J.C. Niebles, C.-W. Chen, and L. Fei-Fei, "Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification," Proc. European Conf. Computer Vision, 2010.
[32] A. Prest, C. Schmid, and V. Ferrari, "Weakly Supervised Learning of Interactions between Humans and Objects," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 601-614, Mar. 2012.
[33] D. Ramanan, D.A. Forsyth, and A. Zisserman, "Tracking People by Learning Their Appearance," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 65-81, Jan. 2007.
[34] M.D. Rodriguez, J. Ahmed, and M. Shah, "Action Mach: A Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[35] S. Satkin and M. Hebert, "Modeling the Temporal Extent of Actions," Proc. European Conf. Computer Vision, 2010.
[36] C. Schuldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach," Proc. Int'l Conf. Pattern Recognition, 2004.
[37] J. Sivic, M. Everingham, and A. Zisserman, "Person Spotting: Video Shot Retrieval for Face Sets," Proc. Int'l Conf. Image and Video Retrieval, 2005.
[38] J. Sivic, M. Everingham, and A. Zisserman, ""Who Are You?"— Learning Person Specific Classifiers from Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[39] N. Sundaram, T. Brox, and K. Keutzer, "Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow," Proc. European Conf. Computer Vision, 2010.
[40] G. Willems, J.H. Becker, T. Tuytelaars, and L. van Gool, "Exemplar-Based Action Recognition in Video," Proc. British Machine Vision Conf., 2009.
[41] S. Wu, B.E. Moore, and M. Shah, "Chaotic Invariants of Lagrangian Particle Trajectories for Anomaly Detection in Crowded Scenes," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[42] Z. Wu, M. Betke, J. Wang, and V. Athitsos, "Tracking with Dynamic Hidden-State Shape Models," Proc. European Conf. Computer Vision, 2008.
[43] C. Yang, R. Duraiswami, and L. Davis, "Efficient Mean-Shift Tracking via a New Similarity Measure," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[44] B. Yao and L. Fei-Fei, "Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[45] B. Yao and L. Fei-Fei, "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[46] A. Yilmaz and M. Shah, "Actions Sketch: A Novel Action Representation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[47] L. Zelnik-Manor and M. Irani, "Event-Based Analysis of Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001.
56 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool