This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Slow Feature Analysis for Human Action Recognition
March 2012 (vol. 34 no. 3)
pp. 436-450
Zhang Zhang, Nat. Lab. of Pattern Recognition, Inst. of Autom., Beijing, China
Dacheng Tao, Centre for Quantum Comput. & Intell. Syst., Univ. of Technol., Sydney, NSW, Australia
Slow Feature Analysis (SFA) extracts slowly varying features from a quickly varying input signal [1]. It has been successfully applied to modeling the visual receptive fields of the cortical neurons. Sufficient experimental results in neuroscience suggest that the temporal slowness principle is a general learning principle in visual perception. In this paper, we introduce the SFA framework to the problem of human action recognition by incorporating the discriminative information with SFA learning and considering the spatial relationship of body parts. In particular, we consider four kinds of SFA learning strategies, including the original unsupervised SFA (U-SFA), the supervised SFA (S-SFA), the discriminative SFA (D-SFA), and the spatial discriminative SFA (SD--SFA), to extract slow feature functions from a large amount of training cuboids which are obtained by random sampling in motion boundaries. Afterward, to represent action sequences, the squared first order temporal derivatives are accumulated over all transformed cuboids into one feature vector, which is termed the Accumulated Squared Derivative (ASD) feature. The ASD feature encodes the statistical distribution of slow features in an action sequence. Finally, a linear support vector machine (SVM) is trained to classify actions represented by ASD features. We conduct extensive experiments, including two sets of control experiments, two sets of large scale experiments on the KTH and Weizmann databases, and two sets of experiments on the CASIA and UT-interaction databases, to demonstrate the effectiveness of SFA for human action recognition. Experimental results suggest that the SFA-based approach (1) is able to extract useful motion patterns and improves the recognition performance, (2) requires less intermediate processing steps but achieves comparable or even better performance, and (3) has good potential to recognize complex multiperson activities.

[1] L. Wiskott, and T. Sejnowski, “Slow Feature Analysis: Unsupervised Learning of Invariances,” Neural Computation, vol. 14, no. 4, pp. 715-770, Apr. 2002.
[2] P. Berkes and L. Wiskott, “Slow Feature Analysis Yields a Rich Repertoire of Complex Cell Properties,” J. Vision, vol. 5, no. 6, pp. 579-602, June 2005.
[3] M. Franzius, H. Sprekeler, and L. Wiskott, “Slowness and Sparseness Lead to Place, Head-Direction, and Spatial-View Cells,” PLoS Computational Biology, vol. 3, no. 8, pp. 1605-1622, Aug. 2007.
[4] M. Franzius, N. Wilbert, and L. Wiskott, “Invariant Object Recognition with Slow Feature Analysis,” Proc. 18th Int'l Conf. Artificial Neural Networks, pp. 961-970, 2008.
[5] P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, “Machine Recognition of Human Activities: A Survey,” IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1473-1488, Sept. 2008.
[6] B.A. Olshausen and D.J. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,” Nature, vol. 381, pp. 607-609, June 1996.
[7] P.O. Hoyer, “Modeling Receptive Fields with Non-Negative Sparse Coding,” Computational Neuroscience: Trends in Research, E.D. Schutter, ed., Elsevier, 2003.
[8] C. Chennubhotla and A. Jepson, “Sparse Coding in Practice,” Proc. Int'l Workshop Statistical and Computational Theories of Vision, 2001.
[9] L. Shang and D. Huang, “Image Denoising Using Non-Negative Sparse Coding Shrinkage Algorithm,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 1017-1022, 2005.
[10] B.J. Shastri and M.D. Levine, “Face Recognition Using Localized Features Based on Non-Negative Sparse Coding,” Machine Vision and Applications, vol. 18, no. 2, pp. 107-122, Apr. 2007.
[11] I. Laptev and T. Lindeberg, “Space-Time Interest Points,” Proc. IEEE Int'l Conf. Computer Vision, pp. 432-439, 2003.
[12] J.C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words,” Int'l J. Computer Vision, vol. 79, no. 3, pp. 299-318, Sept. 2008.
[13] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as Space-Time Shapes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[14] D. Weinland and E. Boyer, “Action Recognition Using Exemplar-Based Embedding,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1-7, 2008.
[15] A. Bobick and J. Davis, “The Recognition of Human Movement Using Temporal Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.
[16] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proc. IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.
[17] A. Oikonomopoulos, I. Patras, and M. Pantic, “Human Action Recognition with Spatiotemporal Salient Points,” IEEE Trans. Systems, Man, and Cybernetics—Part B: Cybernetics, vol. 36, no. 3, pp. 710-719, June 2006.
[18] K. Rapantzikos, Y. Avrithis, and S. Kollias, “Dense Saliency-Based Spatiotemporal Feature Points for Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1454-1461, 2009.
[19] W. Lee and H. Chen, “Histogram-Based Interest Point Detectors,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1590-1596, 2009.
[20] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient Visual Event Detection Using Volumetric Features,” Proc. IEEE Int'l Conf. Computer Vision, pp. 166-173, 2005.
[21] H. Wang, M.M. Ullah, A. Kläser, I. Laptev, and C. Schmid, “Evaluation of Local Spatio-Temporal Features for Action Recognition,” Proc. British Machine Vision Conf., 2009.
[22] I. Laptev and T. Lindeberg, “Local Descriptors for Spatio-Temporal Recognition,” Proc. ECCV Workshop Spatial Coherence for Visual Motion Analysis, 2004.
[23] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning Realistic Human Actions from Movies,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[24] P. Scovanner, S. Ali, and M. Shah, “A 3-Dimensional SIFT Descriptor and Its Application to Action Recognition,” Proc. ACM Int'l Conf. Multimedia, pp. 357-360, 2007.
[25] A. Klaser, M. Marszalek, and C. Schmid, “A Spatio-Temporal Descriptor Based on 3D-Gradients,” Proc. British Machine Vision Conf., 2008.
[26] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” Proc. IEEE Int'l Conf. Pattern Recognition, vol. 3, pp. 32-36, 2004.
[27] Z. Zhang, Y. Hu, S. Chan, and L. Chia, “Motion Context: A New Representation for Human Action Recognition,” Proc. European Conf. Computer Vision, pp. 817-829, 2008.
[28] Y. Wang and G. Mori, “Human Action Recognition by Semi-Latent Topic Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1762-1774, Oct. 2009.
[29] J. Liu and M. Shah, “Learning Human Actions via Information Maximization,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2008.
[30] J. Liu, S. Ali, and M. Shah, “Recognizing Human Actions Using Multiple Features,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2008.
[31] S. Ali, A. Basharat, and M. Shah, “Chaotic Invariants for Human Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[32] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A Biologically Inspired System for Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[33] K. Schindler and L. Gool, “Action Snippets: How Many Frames Does Human Action Recognition Require?” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2008.
[34] M. Bregonzio, S. Gong, and T. Xiang, “Recognising Action as Clouds of Space-Time Interest Points,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2009.
[35] S.-F. Wong, T.-K. Kim, and R. Cipolla, “Learning Motion Categories Using Both Semantic and Structural Information,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2007.
[36] S. Savarese, A. DelPozo, J. Niebles, and L. Fei-Fei, “Spatial-Temporal Correlatons for Unsupervised Action Classification,” Proc. IEEE Workshop Motion and Video Computing, 2008.
[37] M.S. Ryoo and J.K. Aggarwal, “Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities,” Proc. IEEE Int'l Conf. Computer Vision, 2009.
[38] C.M. Bishop, Neural Networks for Pattern Recognition, second ed. Oxford Univ. Press, 1995.
[39] T. Kadir and M. Brady, “Scale Saliency: A Novel Approach to Salient Feature and Scale Selection,” Proc. Int'l Conf. Visual Information Eng., pp. 25-28, 2003.
[40] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. Ann. Int'l Conf. Research and Development in Information Retrieval, pp. 50-57, 1999.
[41] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research, vol. 3, pp. 993-1022, Jan. 2003.
[42] M. Giese and T. Poggio, “Neural Mechanisms for the Recognition of Biological Movements and Action,” Nature Rev. Neuroscience, vol. 4, pp. 179-192, 2003.
[43] CASIA Action Database, http://www.cbsr.ia.ac.cn/english Action%20Databases%20EN.asp , 2010.
[44] J. Wright, A. Ganesh, A. Yang, and Y. Ma, “Robust Face Recognition via Sparse Representation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 2, pp. 210-227, Feb. 2009.
[45] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Discriminative Learned Dictionaries for Local Image Analysis,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2008.
[46] P. Berkes and L. Wiskott, “On the Analysis and Interpretation of Inhomogeneous Quadratic Forms as Receptive Fields,” Neural Computation, vol. 18, no. 8, pp. 1868-1895, Aug. 2006.
[47] N. Dalal and B. Triggs, “Histogram of Oriented Gradients for Human Detection,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 886-893, 2005.
[48] C. Chang and C. Lin, LIBSVM: A Library for Support Vector Machines, Software http://www.csie.ntu.edu.tw/cjlinlibsvm, 2001.
[49] M.S. Ryoo, and J.K. Aggarwal, An Overview of Contest on Semantic Description of Human Activities (SDHA), Data Set http://cvrc.ece.utexas.edu/SDHA2010Human_Interaction.html , 2010.

Index Terms:
visual databases,feature extraction,image coding,image motion analysis,image recognition,learning (artificial intelligence),statistical distributions,support vector machines,complex multiperson activity recognition,human action recognition performance,slowly varying feature analysis,visual receptive field,cortical neuron,temporal slowness principle,learning principle,visual perception,spatial relationship,original unsupervised SFA learning strategy,supervised SFA-based approach,spatial discriminative SFA,feature function extraction,motion boundary,accumulated squared derivative feature vector,ASD feature encoding,statistical distribution,action sequence,linear support vector machine,Weizmann database,UT-interaction database,motion pattern,Feature extraction,Humans,Visualization,Neurons,Vectors,Spatiotemporal phenomena,Pattern recognition,slow feature analysis.,Human action recognition
Citation:
Zhang Zhang, Dacheng Tao, "Slow Feature Analysis for Human Action Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 436-450, March 2012, doi:10.1109/TPAMI.2011.157
Usage of this product signifies your acceptance of the Terms of Use.