This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Learning Sparse Representations for Human Action Recognition
Aug. 2012 (vol. 34 no. 8)
pp. 1576-1588
R. K. Ward, Dept. of Electr. & Comput. Eng., Univ. of British Columbia, Vancouver, BC, Canada
T. Guha, Dept. of Electr. & Comput. Eng., Univ. of British Columbia, Vancouver, BC, Canada
This paper explores the effectiveness of sparse representations obtained by learning a set of overcomplete basis (dictionary) in the context of action recognition in videos. Although this work concentrates on recognizing human movements-physical actions as well as facial expressions-the proposed approach is fairly general and can be used to address other classification problems. In order to model human actions, three overcomplete dictionary learning frameworks are investigated. An overcomplete dictionary is constructed using a set of spatio-temporal descriptors (extracted from the video sequences) in such a way that each descriptor is represented by some linear combination of a small number of dictionary elements. This leads to a more compact and richer representation of the video sequences compared to the existing methods that involve clustering and vector quantization. For each framework, a novel classification algorithm is proposed. Additionally, this work also presents the idea of a new local spatio-temporal feature that is distinctive, scale invariant, and fast to compute. The proposed approach repeatedly achieves state-of-the-art results on several public data sets containing various physical actions and facial expressions.

[1] B. Wohlberg, "Noise Sensitivity of Sparse Signal Representations: Reconstruction Error Bounds for the Inverse Problem," IEEE Trans. Signal Processing, vol. 51, no. 12, pp. 3053-3060, Dec. 2003.
[2] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma, "Robust Face Recognition via Sparse Representation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, Feb. 2008.
[3] M.S. Lewicki and T.J. Sejnowski, "Learning Overcomplete Representations," Neural Computation, vol. 12, no. 2, pp. 337-365, 2000.
[4] K. Engan, S.O. Aase, and J.H. Husoy, "Method of Optimal Directions for Frame Design," Proc. IEEE Int'l Conf. Audio, Speech and Signal Processing, 1999.
[5] M. Aharon, M. Elad, and A. Bruckstein, "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation," IEEE Trans. Signal Processing, vol. 54, no. 11, pp. 4311-4322, Nov. 2006.
[6] J. Mairal, M. Elad, and G. Sapiro, "Sparse Representation for Color Image Restoration," IEEE Trans. Image Processing, vol. 17, no. 1, pp. 53-69, Jan. 2008.
[7] M. Elad and M. Aharon, "Image Denoising via Sparse and Redundant Representations over Learned Dictionaries," IEEE Trans. Image Processing, vol. 15, no. 12, pp. 3736-3745, Dec. 2006.
[8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. Second IEEE Joint Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, Oct. 2005.
[9] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[10] Y. Wang and G. Mori, "Human Action Recognition by Semilatent Topic Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1762-1774, Oct. 2009.
[11] M. Rodriguez, J. Ahmed, and M. Shah, "Action Match a Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, June 2008.
[12] G. Peyré, "Sparse Modeling of Textures," J. Math. Imaging and Vision, vol. 34, no. 1, pp. 17-31, 2009.
[13] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, "Discriminative Learned Dictionaries for Local Image Analysis," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[14] J. Yang, K. Yu, Y. Gong, and T. Huang, "Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1794-1801, 2009.
[15] Y. Zhu, X. Zhao, Y. Fu, and Y. Liu, "Sparse Coding on Local Spatial-Temporal Volumes for Human Action Recognition," Proc. 10th Asian Conf. Computer Vision, vol. 6493, pp. 660-671, 2010.
[16] I. Laptev, "On Space-Time Interest Points," Int'l J. Computer Vision, vol. 64, pp. 107-123, 2005.
[17] P. Scovanner, S. Ali, and M. Shah, "A 3-Dimensional Sift Descriptor and Its Application to Action Recognition," Proc. 15th Int'l Conf. Multimedia, pp. 357-360, 2007.
[18] A.A. Efros, A.C. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," Proc. Ninth IEEE Int'l Conf. Computer Vision, 2003.
[19] F. Jurie and B. Triggs, "Creating Efficient Codebooks for Visual Recognition," Proc. 10th IEEE Int'l Conf. Computer Vision, 2005.
[20] H. Wang, M.M. Ullah, A. Klaser, I. Laptev, and C. Schmid, "Evaluation of Local Spatio-Temporal Features for Action Recognition," Proc. British Machine Vision Conf., Sept. 2009.
[21] J. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words," Int'l J. Computer Vision, vol. 79, pp. 299-318, 2008.
[22] I.N. Junejo, E. Dexter, I. Laptev, and P. Perez, "View-Independent Action Recognition from Temporal Self-Similarities," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 172-185, Jan. 2011.
[23] A.F. Bobick and J.W. Davis, "The Recognition of Human Movement Using Temporal Templates," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.
[24] D.G. Lowe, "Distinctive Image Features from Scale-invariant Keypoints," Int'l J. Computer Vision, vol. 60, pp. 91-110, 2004.
[25] R. Baraniuk and M. Wakin, "Random Projections of Smooth Manifolds," Foundations of Computational Math., vol. 9, pp. 51-77, 2009.
[26] M.A. Fischler and R.C. Bolles, "Random Sample Consensus: a Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography," Comm. ACM, vol. 24, pp. 381-395, June 1981.
[27] Y. Eldar and H. Bolcskei, "Block-Sparsity: Coherence and Efficient Recovery," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 2885-2888, 2009.
[28] C. Schmid, R. Mohr, and C. Bauckhage, "Evaluation of Interest Point Detectors," Int'l J. Computer Vision, vol. 37, pp. 151-172, 2000.
[29] L. Yeffet and L. Wolf, "Local Trinary Patterns for Human Action Recognition," Proc. IEEE Int'l Conf. Computer Vision, pp. 492-497, Oct. 2009.
[30] M.D. Hayko Riemenschneider and H. Bischof, "Bag of Optical Flow Volumes for Image Sequence Recognition," Proc. British Machine Vision Conf., 2009.
[31] S. Ali and M. Shah, "Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 2, pp. 288-303, Feb. 2010.
[32] C. Thurau and V. Hlavac, "Pose Primitive Based Human Action Recognition in Videos or Still Images," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[33] Z. Zhang, Y. Hu, S. Chan, and L.-T. Chia, "Motion Context: A New Representation for Human Action Recognition," Proc. European Conf. Computer Vision, vol. 5305, pp. 817-829, 2008.
[34] A. Fathi and G. Mori, "Action Recognition by Learning Mid-Level Motion Features," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[35] A. Yao, J. Gall, and L. Van Gool, "A Hough Transform-based Voting Framework for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2061-2068, June 2010.
[36] A. Kovashka and K. Grauman, "Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2046-2053, June 2010.

Index Terms:
video signal processing,dictionaries,face recognition,gesture recognition,image classification,image representation,image sequences,learning (artificial intelligence),pattern clustering,vector quantisation,vector quantization,sparse representation,human action recognition,human movement recognition,physical action,facial expression,classification problem,human action model,overcomplete dictionary learning framework,spatio-temporal descriptor,video sequence representation,dictionary element,clustering,Dictionaries,Vectors,Feature extraction,Videos,Detectors,Video sequences,Humans,spatio-temporal descriptors.,Action recognition,dictionary learning,expression recognition,overcomplete,orthogonal matching pursuit,sparse representation
Citation:
R. K. Ward, T. Guha, "Learning Sparse Representations for Human Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1576-1588, Aug. 2012, doi:10.1109/TPAMI.2011.253
Usage of this product signifies your acceptance of the Terms of Use.