The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - Jan. (2013 vol.35)
pp: 221-231
Shuiwang Ji , Dept. of Comput. Sci., Old Dominion Univ., Norfolk, VA, USA
Wei Xu , Facebook, Inc., Menlo Park, CA, USA
Ming Yang , NEC Labs. America, Inc., Cupertino, CA, USA
Kai Yu , Baidu Inc., Beijing, China
ABSTRACT
We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.
INDEX TERMS
Three dimensional displays, Solid modeling, Feature extraction, Computer architecture, Videos, Kernel, Computational modeling,action recognition, Deep learning, convolutional neural networks, 3D convolution, model combination
CITATION
Shuiwang Ji, Wei Xu, Ming Yang, Kai Yu, "3D Convolutional Neural Networks for Human Action Recognition", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 1, pp. 221-231, Jan. 2013, doi:10.1109/TPAMI.2012.59
REFERENCES
[1] I. Laptev and T. Lindeberg, "Space-Time Interest Points," Proc. Ninth IEEE Int'l Conf. Computer Vision, pp. 432-439, 2003.
[2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning Realistic Human Actions from Movies," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[3] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1996-2003, 2009.
[4] Y. Wang and G. Mori, "Max-Margin Hidden Conditional Random Fields for Human Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 872-879, 2009.
[5] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, "Automatic Annotation of Human Actions in Video," Proc. 12th IEEE Int'l Conf. Computer Vision, pp. 1491-1498, 2009.
[6] Y. Wang and G. Mori, "Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 7, pp. 1310-1323, July 2011.
[7] H. Wang, M.M. Ullah, A. Kläser, I. Laptev, and C. Schmid, "Evaluation of Local Spatio-Temporal Features for Action Recognition," Proc. British Machine Vision Conf., p. 127, 2009.
[8] M. Marszalek, I. Laptev, and C. Schmid, "Actions in Context," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2929-2936, 2009.
[9] I. Junejo, E. Dexter, I. Laptev, and P. Pérez, "View-Independent Action Recognition from Temporal Self-Similarities," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 172-185, Jan. 2011.
[10] V. Delaitre, I. Laptev, and J. Sivic, "Recognizing Human Actions in Still Images: A Study of Bag-of-Features and Part-Based Representations," Proc. 21st British Machine Vision Conf., 2010.
[11] Q. Le, W. Zou, S. Yeung, and A. Ng, "Learning Hierarchical Invariant Spatio-Temporal Features for Action Recognition with Independent Subspace Analysis," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 3361-3368, 2011.
[12] A.A. Efros, A.C. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," Proc. Ninth IEEE Int'l Conf. Computer Vision, pp. 726-733, 2003.
[13] C. Schüldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach," Proc. 17th Int'l Conf. Pattern Recognition, pp. 32-36, 2004.
[14] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.
[15] I. Laptev and P. Pérez, "Retrieving Actions in Movies," Proc. 11th IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[16] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, "A Biologically Inspired System for Action Recognition," Proc. 11th IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning Applied to Document Recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[18] G.E. Hinton and R.R. Salakhutdinov, "Reducing the Dimensionality of Data with Neural Networks," Science, vol. 313, no. 5786, pp. 504-507, July 2006.
[19] G.E. Hinton, S. Osindero, and Y. Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, vol. 18, pp. 1527-1554, 2006.
[20] Y. Bengio, "Learning Deep Architectures for AI," Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.
[21] Y. Bengio and Y. LeCun, "Scaling Learning Algorithms towards AI," Large-Scale Kernel Machines, L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, eds., MIT Press, 2007.
[22] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun, "Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
[23] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng, "Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations," Proc. 26th Ann. Int'l Conf. Machine Learning, pp. 609-616, 2009.
[24] M. Norouzi, M. Ranjbar, and G. Mori, "Stacks of Convolutional Restricted Boltzmann Machines for Shift-Invariant Feature Learning," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[25] M. Yang, S. Ji, W. Xu, J. Wang, F. Lv, K. Yu, Y. Gong, M. Dikmen, D.J. Lin, and T.S. Huang, "Detecting Human Actions in Surveillance Videos," Proc. TREC Video Retrieval Evaluation Workshop, 2009.
[26] S. Ji, W. Xu, M. Yang, and K. Yu, "3D Convolutional Neural Networks for Human Action Recognition," Proc. 27th Int'l Conf. Machine Learning, pp. 495-502, 2010.
[27] G.W. Taylor, R. Fergus, Y. LeCun, and C. Bregler, "Convolutional Learning of Spatio-Temporal Features," Proc. 11th European Conf. Computer Vision, pp. 140-153, 2010.
[28] R. Collobert and J. Weston, "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning," Proc. 25th Int'l Conf. Machine Learning, pp. 160-167, 2008.
[29] H. Lee, P. Pham, Y. Largman, and A. Ng, "Unsupervised Feature Learning for Audio Classification Using Convolutional Deep Belief Networks," Proc. Advances in Neural Information Processing Systems 22, pp. 1096-1104, 2009.
[30] H. Cecotti and A. Graser, "Convolutional Neural Networks for P300 Detection with Application to Brain-Computer Interfaces," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 3, pp. 433-445, Mar. 2011.
[31] J. Fan, W. Xu, Y. Wu, and Y. Gong, "Human Tracking Using Convolutional Neural Networks," IEEE Trans. Neural Networks, vol. 21, no. 10, pp. 1610-1623, Oct. 2010.
[32] V. Jain, J.F. Murray, F. Roth, S. Turaga, V. Zhigulin, K.L. Briggman, M.N. Helmstaedter, W. Denk, and H.S. Seung, "Supervised Learning of Image Restoration with Convolutional Networks," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[33] V. Jain and S. Seung, "Natural Image Denoising with Convolutional Networks," Proc. Advances in Neural Information Processing Systems 21, pp. 769-776, 2009.
[34] S.C. Turaga, J.F. Murray, V. Jain, F. Roth, M. Helmstaedter, K. Briggman, W. Denk, and H.S. Seung, "Convolutional Networks Can Learn to Generate Affinity Graphs for Image Segmentation," Neural Computation, vol. 22, no. 2, pp. 511-538, 2010.
[35] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. Xing, "Training Hierarchical Feed-Forward Visual Recognition Models Using Transfer Learning from Pseudo-Tasks," Proc. 10th European Conf. Computer Vision, pp. 69-82, 2008.
[36] K. Yu, W. Xu, and Y. Gong, "Deep Learning with Kernel Regularization for Visual Recognition," Proc. Advances in Neural Information Processing Systems 21, pp. 1889-1896, 2009.
[37] H. Mobahi, R. Collobert, and J. Weston, "Deep Learning from Temporal Coherence in Video," Proc. 26th Ann. Int'l Conf. Machine Learning, pp. 737-744, 2009.
[38] Y. LeCun, F. Huang, and L. Bottou, "Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2004.
[39] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. Barbano, "Toward Automatic Phenotyping of Developing Embryos from Videos," IEEE Trans. Image Processing, vol. 14, no. 9, pp. 1360-1371, Sept. 2005.
[40] D.G. Lowe, "Distinctive Image Features from Scale Invariant Keypoints," Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[41] M. Yang, F. Lv, W. Xu, K. Yu, and Y. Gong, "Human Action Detection by Boosting Efficient Motion Features," Proc. IEEE Workshop Video-Oriented Object and Event Classification, 2009.
[42] L. Breiman, "Bagging Predictors," Machine Learning, vol. 24, pp. 123-140, 1996.
[43] Y. Freund and R.E. Schapire, "Experiments with a New Boosting Algorithm," Proc. 13th Int'l Conf. Machine Learning, pp. 148-156, 1996.
[44] J. Kittler, M. Hatef, R.P. Duin, and J. Matas, "On Combining Classifiers," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226-239, Mar. 1998.
[45] L.K. Hansen and P. Salamon, "Neural Network Ensembles," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993-1001, Oct. 1990.
[46] Y. LeCun, L. Bottou, G. Orr, and K. Muller, "Efficient Backprop," Neural Networks: Tricks of the Trade, G. Orr and M. Klaus-Robert, eds., Springer, 1998.
[47] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, "What Is the Best Multi-Stage Architecture for Object Recognition?" Proc. 12thIEEE Int'l Conf. Computer Vision, 2009.
[48] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, "Object Recognition with Cortex-Like Mechanisms," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 411-426, Mar. 2007.
[49] J. Mutch and D.G. Lowe, "Object Class Recognition and Localization Using Sparse Features with Limited Receptive Fields," Int'l J. Computer Vision, vol. 80, no. 1, pp. 45-57, Oct. 2008.
[50] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, "Signature Verification Using a Siamese Time Delay Neural Network," Proc. Advances in Neural Information Processing Systems 6, pp. 737-744, 1994.
[51] H.-J. Kim, J.S. Lee, and H.-S. Yang, "Human Action Recognition Using a Modified Convolutional Neural Network," Proc. Fourth Int'l Symp. Neural Networks, pp. 715-723, 2007.
[52] M. Yang, F. Lv, W. Xu, and Y. Gong, "Detection Driven Adaptive Multi-Cue Integration for Multiple Human Tracking," Proc. 12th IEEE Int'l Conf. Computer Vision, pp. 1554-1561, 2009.
[53] K. Schindler and L. Van Gool, "Action Snippets: How Many Frames Does Human Action Recognition Require?" Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[54] G. Zhu, M. Yang, K. Yu, W. Xu, and Y. Gong, "Detecting Video Events Based on Action Recognition in Complex Scenes Using Spatio-Temporal Descriptor," Proc. 17th ACM Int'l Conf. Multimedia, pp. 165-174, 2009.
[55] S. Lazebnik, C. Achmid, and J. Ponce, "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2169-2178, 2006.
[56] J.C. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words," Int'l J. Computer Vision, vol. 79, no. 3, pp. 299-318, 2008.
12 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool