This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Recognizing Human Actions by Learning and Matching Shape-Motion Prototype Trees
March 2012 (vol. 34 no. 3)
pp. 533-547
Zhe Lin, Adv. Technol. Labs., Adobe Syst. Inc., San Jose, CA, USA
Zhuolin Jiang, Inst. for Adv. Comput. Studies, Univ. of Maryland, College Park, MD, USA
Larry S. Davis, Inst. for Adv. Comput. Studies, Univ. of Maryland, College Park, MD, USA
A shape-motion prototype-based approach is introduced for action recognition. The approach represents an action as a sequence of prototypes for efficient and flexible action matching in long video sequences. During training, an action prototype tree is learned in a joint shape and motion space via hierarchical K-means clustering and each training sequence is represented as a labeled prototype sequence; then a look-up table of prototype-to-prototype distances is generated. During testing, based on a joint probability model of the actor location and action prototype, the actor is tracked while a frame-to-prototype correspondence is established by maximizing the joint probability, which is efficiently performed by searching the learned prototype tree; then actions are recognized using dynamic prototype sequence matching. Distance measures used for sequence matching are rapidly obtained by look-up table indexing, which is an order of magnitude faster than brute-force computation of frame-to-frame distances. Our approach enables robust action matching in challenging situations (such as moving cameras, dynamic backgrounds) and allows automatic alignment of action sequences. Experimental results demonstrate that our approach achieves recognition rates of 92.86 percent on a large gesture data set (with dynamic backgrounds), 100 percent on the Weizmann action data set, 95.77 percent on the KTH action data set, 88 percent on the UCF sports data set, and 87.27 percent on the CMU action data set.

[1] T.B. Moeslund, A. Hilton, and V. Kruger, “A Survey of Adances in Vision-Based Human Motion Capture and Analysis,” Computer Vision and Image Understanding, vol. 104, no. 2, pp. 90-126, 2006.
[2] P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, “Machine Recogntion of Human Activities: A Survey,” IEEE Trans. Circuits and Systems for Video Technology, vol. 11, no. 8, pp. 1473-1488, Nov. 2008.
[3] R. Poppe, “A Survey on Vision-Based Human Action Recognition,” Image and Vision Computing, vol. 28, no. 6, pp. 976-990, 2010.
[4] H. Li and M. Greenspan, “Multi-Scale Gesture Recognition from Time-Varying Contours,” Proc. IEEE Int'l Conf. Computer Vision, vol. 1, pp. 236-243, 2005.
[5] Y. Shen and H. Foroosh, “View-Invariant Action Recognition Using Fundamental Ratios,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-6, 2008.
[6] P. Natarajan, V. Singh, and R. Nevatia, “Learning 3D Action Models from a Few 2D Videos for View Invariant Action Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2010.
[7] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing Action at a Distance,” Proc. IEEE Int'l Conf. Computer Vision, vol. 2, pp. 726-733, 2003.
[8] G.R. Bradski and J.W. Davis, “Motion Segmentation and Pose Recognition with Motion History Gradients,” Machine Vision and Applications, vol. 13, pp. 174-184, 2002.
[9] A. Fathi and G. Mori, “Action Recognition by Learning Mid-Level Motion Features,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[10] Y. Wang, P. Sabzmeydani, and G. Mori, “Semi-Latent Dirichlet Allocation: A Hierarchical Model for Human Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision Workshop Human Motion Understanding, Modeling, Capture and Animation, pp. 240-254, 2007.
[11] A. Elgammal, V. Shet, Y. Yacoob, and L.S. Davis, “Learning Dynamics for Exemplar-Based Gesture Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 571-578, 2003.
[12] C. Thurau and V. Hlavac, “Pose Primitive Based Human Action Recognition in Videos or Still Images,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[13] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” Proc. Int'l Conf. Pattern Recognition, vol. 3, pp. 32-36, 2004.
[14] I. Laptev and P. Perez, “Retrieving Actions in Movies,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[15] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning Realistic Human Actions from Movies,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[16] E. Shechtman and M. Irani, “Space-Time Behavior-Based Correlation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 11, pp. 2045-2056, Nov. 2007.
[17] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as Space-Time Shapes,” Proc. IEEE Int'l Conf. Computer Vision, vol. 2, pp. 1395-1402, 2005.
[18] J.C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-temporal Words,” Int'l J. Computer Vision, vol. 79, no. 3, pp. 299-318, 2007.
[19] Y. Ke, R. Sukthankar, and M. Hebert, “Spatio-Temporal Shape and Flow Correlation for Action Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[20] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proc. Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.
[21] J. Liu and M. Shah, “Learning Human Actions via Information Maximization,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[22] Y. Ke, R. Sukthankar, and M. Hebert, “Event Detection in Crowded Videos,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[23] S. Nowozin, G. Bakir, and K. Tsuda, “Discriminative Subsequence Mining for Action Classification,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[24] M.D. Rodriguez, J. Ahmed, and M. Shah, “Action Mach: A Spatio-Temporal Maximum Average Correlation Height Filter for Action Recogntion,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[25] S. Wong and R. Cipolla, “Extracting Spatiotemporal Interest Points Using Global Information,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[26] J. Liu, J. Luo, and M. Shah, “Recognizing Realistic Actions from Videos in the Wild,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2009.
[27] J. Yuan, Z. Liu, and Y. Wu, “Discriminative Subvolume Search for Efficient Action Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2009.
[28] A. Kovashka and K. Grauman, “Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2010.
[29] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A Biologically Inspired System for Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[30] J. Liu, S. Ali, and M. Shah, “Recognizing Human Actions Using Multiple Features,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[31] V.D. Shet, V.S.N. Prasad, A. Elgammal, Y. Yacoob, and L.S. Davis, “Multi-Cue Exemplar-Based Nonparametric Model for Gesture Recognition,” Proc. Indian Conf. Computer Vision, Graphics, and Image Processing, pp. 656-662, 2004.
[32] K. Schindler and L.V. Gool, “Action Snippets: How Many Frames Does Human Action Recognition Require?” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[33] J.C. Niebles and L. Fei-Fei, “A Hierarchical Model of Shape and Appearance for Human Action Classification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[34] M. Ahmad and S. Lee, “Human Action Recognition Using Shape and CLG-Motion Flow from Multi-View Image Sequences,” Pattern Recognition, vol. 41, no. 7, pp. 2237-2252, 2008.
[35] K. Mikolajczyk and H. Uemura, “Action Recognition with Motion-Appearance Vocabulary Forest,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[36] I. Junejo, E. Dexter, I. Laptev, and P. Perez, “View-Independent Action Recognition from Temporal Self-Similarities,” IEEE Trans. Pattern Analysis and Machine Intellegence, vol. 33, no. 1, pp. 172-185, Jan. 2011.
[37] V. Parameswaran and R. Chellappa, “View Invariance for Human Action Recognition,” Int'l J. Computer Vision, vol. 66, no. 1, pp. 83-101, 2006.
[38] R. Souvenir and J. Babbs, “Learning the Viewpoint Manifold for Action Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[39] D. Weinland and E. Boyer, “Action Recognition Using Exemplar-Based Embedding,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-7, 2008.
[40] W. Li, Z. Zhang, and Z. Liu, “Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures,” IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1499-1510, Nov. 2008.
[41] F. Lv and R. Nevatia, “Single View Human Action Recognition Using Key Pose Matching and Viterbi Path Searching,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[42] C. Fanti, L. Zelnik-Manor, and P. Perona, “Hybrid Models for Human Motion Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 1166-1173, 2005.
[43] M. Marszalek, I. Laptev, and C. Schmid, “Actions in Context,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2009.
[44] Q. Shi, L. Wang, L. Cheng, and A. Smola, “Discriminative Human Action Segmentation and Recognition Using Semi-Markov Model,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[45] A. Veeraraghavan, R. Chellappa, and A.K. Roy-Chowdhury, “The Function Space of an Activity,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 959-968, 2006.
[46] S.N. Vitaladevuni, V. Kellokumpu, and L.S. Davis, “Action Recognition Using Ballistic Dynamics,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[47] A. Yao, J. Gall, and L.V. Gool, “A Hough Transform-Based Voting Framework for Action Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2010.
[48] C. Sminachisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Conditional Models for Contextual Human Motion Recognition,” Proc. IEEE Int'l Conf. Computer Vision, vol. 2, pp. 1808-1815, 2005.
[49] D. Nister and H. Stewenius, “Scalable Recognition with a Vocabulary Tree,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2161-2168, 2006.
[50] D.M. Gavrila and V. Philomin, “Real-Time Object Detection for ‘Smart’ Vehicles,” Proc. IEEE Int'l Conf. Computer Vision, pp. 87-93, 1999.
[51] S. Salvador and P. Chan, “Fastdtw: Toward Accurate Dynamic Time Warping in Linear Time and Space,” Proc. KDD Workshop Mining Temporal and Sequential Data, pp. 70-80, 2004.
[52] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 886-893, 2005.
[53] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-Based Object Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 5, pp. 564-577, May 2003.
[54] B. Han, D. Comaniciu, Y. Zhu, and L.D. Davis, “Sequential Kernel Density Approximation and Its Application to Real-Time Visual Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 7, pp. 1186-1197, July 2008.
[55] K. Kim and L.S. Davis, “Multi-Camera Tracking and Segmentation of Occluded People on Ground Plane Using Search-Guided Particle Filtering,” Proc. European Conf. Computer Vision, pp. 98-109, 2006.
[56] US-ARMY, “Visual Signals,” Field Manual, pp. 21-60, 1987.
[57] S. Ali, A. Basharat, and M. Shah, “Chaotic Invariants for Human Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[58] L. Wang and D. Suter, “Recognizing Human Activities from Silhouettes: Motion Subspace and Factorial Discriminative Graphical Model,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[59] K. Kim, T.H. Chalidabhongse, D. Harwood, and L.S. Davis, “Real-Time Foreground-Background Segmentation Using Codebook Model,” Real-Time Imaging, vol. 11, no. 3, pp. 167-256, 2005.
[60] L. Yeffet and L. Wolf, “Local Trinary Patterns for Human Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2009.
[61] H. Wang, M. Ulah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of Local Spatio-Temporal Features for Action Recognition,” Proc. British Machine Vision Conf., pp. 1-11, 2009.
[62] B. Yao and S. Zhu, “Learning Deformable Action Templates from Cluttered Videos,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2009.

Index Terms:
video signal processing,image matching,image recognition,image sequences,learning (artificial intelligence),pattern clustering,table lookup,learning,human action recognition,shape-motion prototype-based approach,flexible action matching,video sequences,joint shape,motion space,hierarchical k-means clustering,training sequence,prototype-to-prototype distances,joint probability model,actor location,action prototype,frame-to-prototype correspondence,dynamic prototype sequence matching,distance measures,look-up table indexing,brute-force computation,frame-to-frame distances,moving cameras,dynamic backgrounds,large gesture data set,Weizmann action data set,KTH action data set,UCF sports data set,CMU action data set,Prototypes,Shape,Feature extraction,Humans,Hidden Markov models,Joints,Training,dynamic time warping.,Action recognition,shape-motion prototype tree,hierarchical K-means clustering,joint probability
Citation:
Zhe Lin, Zhuolin Jiang, Larry S. Davis, "Recognizing Human Actions by Learning and Matching Shape-Motion Prototype Trees," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 533-547, March 2012, doi:10.1109/TPAMI.2011.147
Usage of this product signifies your acceptance of the Terms of Use.