The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2014 vol.36)
pp: 436-452
Benjamin Z. Yao , Beijing Univ. of Posts & Telecommun., Beijing, China
Bruce X. Nie , Stat. Dept., Univ. of California at Los Angeles, Los Angeles, CA, USA
Zicheng Liu , Microsoft Res., Redmond, WA, USA
Song-Chun Zhu , Stat. Dept., Univ. of California at Los Angeles, Los Angeles, CA, USA
ABSTRACT
This paper presents animated pose templates (APTs) for detecting short-term, long-term, and contextual actions from cluttered scenes in videos. Each pose template consists of two components: 1) a shape template with deformable parts represented in an And-node whose appearances are represented by the Histogram of Oriented Gradient (HOG) features, and 2) a motion template specifying the motion of the parts by the Histogram of Optical-Flows (HOF) features. A shape template may have more than one motion template represented by an Or-node. Therefore, each action is defined as a mixture (Or-node) of pose templates in an And-Or tree structure. While this pose template is suitable for detecting short-term action snippets in two to five frames, we extend it in two ways: 1) For long-term actions, we animate the pose templates by adding temporal constraints in a Hidden Markov Model (HMM), and 2) for contextual actions, we treat contextual objects as additional parts of the pose templates and add constraints that encode spatial correlations between parts. To train the model, we manually annotate part locations on several keyframes of each video and cluster them into pose templates using EM. This leaves the unknown parameters for our learning algorithm in two groups: 1) latent variables for the unannotated frames including pose-IDs and part locations, 2) model parameters shared by all training samples such as weights for HOG and HOF features, canonical part locations of each pose, coefficients penalizing pose-transition and part-deformation. To learn these parameters, we introduce a semi-supervised structural SVM algorithm that iterates between two steps: 1) learning (updating) model parameters using labeled data by solving a structural SVM optimization, and 2) imputing missing variables (i.e., detecting actions on unlabeled frames) with parameters learned from the previous step and progressively accepting high-score frames as newly labeled examples. This algorithm belongs to a family of optimization methods known as the Concave-Convex Procedure (CCCP) that converge to a local optimal solution. The inference algorithm consists of two components: 1) Detecting top candidates for the pose templates, and 2) computing the sequence of pose templates. Both are done by dynamic programming or, more precisely, beam search. In experiments, we demonstrate that this method is capable of discovering salient poses of actions as well as interactions with contextual objects. We test our method on several public action data sets and a challenging outdoor contextual action data set collected by ourselves. The results show that our model achieves comparable or better performance compared to state-of-the-art methods.
INDEX TERMS
Hidden Markov models, Videos, Shape, Optical imaging, Support vector machines, Feature extraction, Complexity theory,animated pose templates, Action detection, action recognition, structural SVM
CITATION
Benjamin Z. Yao, Bruce X. Nie, Zicheng Liu, Song-Chun Zhu, "Animated Pose Templates for Modeling and Detecting Human Actions", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.36, no. 3, pp. 436-452, March 2014, doi:10.1109/TPAMI.2013.144
REFERENCES
[1] C. Schuldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach," Proc. IEEE Int'l Conf. Pattern Recognition (ICPR), 2004.
[2] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[3] K. Schindler and L.V. Gool, "Action Snippets: How Many Frames Does Human Action Recognition Require?" Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2008.
[4] I. Essa and A. Pentland, "Coding, Analysis, Interpretation, and Recognition of Facial Expressions," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 757-763, July 1997.
[5] A. Bobick and J. Davis, "The Recognition of Human Movement Using Temporal Templates," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.
[6] A. Efros, A. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," Proc. Ninth IEEE Int'l Conf. Computer Vision (ICCV), pp. 726-733, 2003.
[7] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning Realistic Human Actions from Movies," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2008.
[8] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. IEEE Int'l Conf. Computer Vision Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS), 2005.
[9] A. Kovashka and K. Grauman, "Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2010.
[10] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, "Object Detection with Discriminatively Trained Part-Based Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 9, pp. 1627-1645, Sept. 2010.
[11] W. Yang, Y. Wang, and G. Mori, "Recognizing Human Actions from Still Images with Latent Poses," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2030-2037, 2010.
[12] Y. Yang and D. Ramanan, "Articulated Pose Estimation with Flexible Mixtures-of-Parts," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2011.
[13] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, "Pose Search: Retrieving People Using their Pose," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2009.
[14] S. Johnson and M. Everingham, "Clustered Pose and Nonlinear Appearance Models for Human Pose Estimation," Proc. British Machine Vision Conf. (BMVC), 2010.
[15] M. Marszalek, I. Laptev, and C. Schmid, "Actions in Context," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2009.
[16] T. Lan, Y. Wang, W. Yang, and G. Mori, "Beyond Actions: Discriminative Models for Contextual Group Activities," Proc. Advances in Neural Information Processing Systems, 2010.
[17] A. Gupta, A. Kembhavi, and L. Davis, "Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1775-1789, Oct. 2009.
[18] B. Yao and L. Fei-Fei, "Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 17-24, 2010.
[19] M. Pei, Y. Jia, and S.-C. Zhu, "Parsing Video Events with Goal Inference and Intent Prediction," Proc. IEEE Int'l Conf. Computer Vision (ICCV), 2011.
[20] B. Lucas and T. Kanade, "An Iterative Image Registration Technique with an Application to Stereo Vision," Proc. Seventh Int'l Joint Conf. Artificial Intelligence, 1981.
[21] E. Muybridge, Animals in Motion. Dover Publications, 1957.
[22] H.-F. Gong and S.-C. Zhu, "Intrackability: Characterizing Video Statistics and Pursuing Video Representations," Int'l J. Computer Vision, vol. 97, no. 3, pp. 255-275, 2012.
[23] B. Yao and S. Zhu, "Learning Deformable Action Templates from Cluttered Videos," Proc. IEEE Int'l Conf. Computer Vision (ICCV), pp. 1507-1514, 2010.
[24] M. Iacoboni, I. Molnar-Szakacs, V. Gallese, G. Buccino, J. Mazziotta, and G. Rizzolatti, "Grasping the Intentions of Others with One's Own Mirror Neuron System," PLoS Biology, vol. 3, no. 3, 2005.
[25] A. Gupta, "Beyond Nouns and Verbs," PhD dissertation, Univ. of Maryland at College Park, 2009.
[26] I. Laptev and P. Pérez, "Retrieving Actions in Movies," Proc. IEEE Int'l Conf. Computer Vision (ICCV), 2007.
[27] S. Branson, P. Perona, and S. Belongie, "Strong Supervision from Weak Annotation: Interactive Training of Deformable Part Models," Proc. IEEE Int'l Conf. Computer Vision (ICCV), pp. 1832-1839, 2011.
[28] H. Azizpour and I. Laptev, "Object Detection Using Strongly-Supervised Deformable Part Models," Proc. 12th European Conf. Computer Vision (ECCV), 2012.
[29] P. Felzenszwalb, R. Girshick, and D. McAllester, "Cascade Object Detection with Deformable Part Models," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2241-2248, 2010.
[30] T. Joachims, T. Finley, and C. Yu, "Cutting-Plane Training of Structural SVMS," Machine Learning, vol. 77, no. 1, pp. 27-59, 2009.
[31] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, "Large Margin Methods for Structured and Interdependent Output Variables," J. Machine Learning Research, vol. 6, no. 2, pp. 1453-1484, 2006.
[32] M.P. Kumar, B. Packer, and D. Koller, "Curriculum Learning for Latent Structural SVM," Proc. Advances in Neural Information Processing Systems (NIPS), 2010.
[33] L. Cao, Z. Liu, and T. Huang, "Cross-Data Set Action Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1998-2005, 2010.
[34] J. Niebles, H. Wang, and L. Fei-fei, "Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words," Int'l J. Computer Vision, vol. 79, pp. 299-318, 2008.
[35] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2009.
[36] S. Sadanand and J. Corso, "Action Bank: A High-Level Representation of Activity in Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1234-1241, 2012.
[37] J. Yuan, Z. Liu, and Y. Wu, "Discriminative Subvolume Search for Efficient Action Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2009.
[38] L. Cao, Y. Tian, Z. Liu, B. Yao, Z. Zhang, and T. Huang, "Action Detection Using Multiple Spatial-Temporal Interest Point Features," Proc. IEEE Int'l Conf. Multimedia and Expo, 2010.
[39] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, "Progressive Search Space Reduction for Human Pose Estimation," Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1-8, 2008.
[40] G. Willems, J. Becker, T. Tuytelaars, and L.V. Gool, "Exemplar-Based Action Recognition in Video," Proc. British Machine Vision Conf. (BMVC), 2009.
[41] A. Klaser, M. Marszałek, C. Schmid, and A. Zisserman, "Human Focused Action Localization in Video," Proc. ECCV Workshop Sign, Gesture, and Activity, 2010.
[42] B. Rothrock and S.-C. Zhu, "Human Parsing Using Stochastic and or Grammar and Rich Appearance," Proc. IEEE Int'l Conf. Computer Vision (ICCV) Workshop Stochastic Image Grammar, 2013.
37 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool