The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - October (2009 vol.31)
pp: 1762-1774
Yang Wang , Simon Fraser University, Burnaby
Greg Mori , Simon Fraser University, Burnaby
ABSTRACT
We propose two new models for human action recognition from video sequences using topic models. Video sequences are represented by a novel “bag-of-words” representation, where each frame corresponds to a “word.” Our models differ from previous latent topic models for visual recognition in two major aspects: first of all, the latent topics in our models directly correspond to class labels; second, some of the latent variables in previous topic models become observed in our case. Our models have several advantages over other latent topic models used in visual recognition. First of all, the training is much easier due to the decoupling of the model parameters. Second, it alleviates the issue of how to choose the appropriate number of latent topics. Third, it achieves much better performance by utilizing the information provided by the class labels in the training set. We present action classification results on five different data sets. Our results are either comparable to, or significantly better than previously published results on these data sets.
INDEX TERMS
Human action recognition, video analysis, bag-of-words, probabilistic graphical models, event and activity understanding
CITATION
Yang Wang, Greg Mori, "Human Action Recognition by Semilatent Topic Models", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.31, no. 10, pp. 1762-1774, October 2009, doi:10.1109/TPAMI.2009.43
REFERENCES
[1] R. Cutler and L.S. Davis, “Robust Real-Time Periodic Motion Detection, Analysis, and Applications,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 781-796, Aug. 2000.
[2] A.A. Efros, A.C. Berg, G. Mori, and J. Malik, “Recognizing Action at a Distance,” Proc. IEEE Int'l Conf. Computer Vision, pp. 726-733, 2003.
[3] J.L. Little and J.E. Boyd, “Recognizing People by Their Gait: The Shape of Motion,” Videre, vol. 1, no. 2, pp. 1-32, 1998.
[4] R. Polana and R.C. Nelson, “Detection and Recognition of Periodic, Non-Rigid Motion,” Int'l J. Computer Vision, vol. 23, no. 3, pp. 261-282, June 1997.
[5] J. Sullivan and S. Carlsson, “Recognizing and Tracking Human Action,” Proc. European Conf. Computer Vision, vol. 1, pp. 629-644, 2002.
[6] L. Fei-Fei, R. Fergus, and P. Perona, “One-Shot Learning of Object Categories,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 594-611, Apr. 2006.
[7] K. Grauman and T. Darrell, “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features,” Proc. IEEE Int'l Conf. Computer Vision, vol. 2, pp. 1458-1465, 2005.
[8] S. Lazebnik, C. Schmid, and J. Ponce, “A Maximum Entropy Framework for Part-Based Texture and Object Recognition,” Proc. IEEE Int'l Conf. Computer Vision, vol. 1, pp. 832-838, 2005.
[9] J. Yamato, J. Ohya, and K. Ishii, “Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 1992.
[10] A.F. Bobick and A.D. Wilson, “A State-Based Approach to the Representation and Recognition of Gesture,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 12, pp. 1325-1337, Dec. 1997.
[11] T. Xiang and S. Gong, “Beyond Tracking: Modelling Activity and Understanding Behaviour,” Int'l J. Computer Vision, vol. 67, no. 1, pp. 21-51, 2006.
[12] D.M. Blei, A.Y. Ng, and M.I. Jordan, “Latent Dirichlet Allocation,” J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[13] D.M. Blei and J.D. Lafferty, “Correlated Topic Models,” Advances in Neural Information Processing Systems, vol. 18, MIT Press, 2006.
[14] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22nd Ann. Int'l Conf. Research and Development in Information Retrieval, pp. 50-57, 1999.
[15] A. Bosch, A. Zisserman, and X. Munoz, “Scene Classification via pLSA,” Proc. European Conf. Computer Vision, vol. 4, pp. 517-530, 2006.
[16] L. Fei-Fei and P. Perona, “A Bayesian Hierarchical Model for Learning Natural Scene Categories,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 524-531, 2005.
[17] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning Object Categories from Google's Image Search,” Proc. IEEE Int'l Conf. Computer Vision, vol. 2, pp. 1816-1823, 2005.
[18] B.C. Russell, A.A. Efros, J. Sivic, W.T. Freeman, and A. Zisserman, “Using Multiple Segmentations to Discover Objects and Their Extent in Image Collections,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
[19] J. Sivic, B.C. Russell, A.A. Efros, A. Zisserman, and W.T. Freeman, “Discovering Objects and Their Location in Images,” Proc. IEEE Int'l Conf. Computer Vision, vol. 1, pp. 370-377, 2005.
[20] J.C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words,” Proc. British Machine Vision Conf., vol. 3, pp. 1249-1258, 2006.
[21] A. Bissacco, M.-H. Yang, and S. Soatto, “Detecting Humans via Their Pose,” Advances in Neural Information Processing Systems, vol. 19, pp. 169-176, MIT Press, 2007.
[22] D.M. Blei and J.D. McAuliffe, “Supervised Topic Models,” Advances in Neural Information Processing Systems, vol. 20, MIT Press, 2008.
[23] P. Flaherty, G. Giaever, J. Kumm, M.I. Jordan, and A.P. Arkin, “A Latent Variable Model for Chemogenomic Profiling,” Bioinformatics, vol. 21, no. 15, pp. 3286-3293, 2005.
[24] Y. Wang, P. Sabzmeydani, and G. Mori, “Semi-Latent Dirichlet Allocation: A Hierarchical Model for Human Action Recognition,” Proc. Second Workshop Human Motion Understanding, Modeling, Capture, and Animation, 2007.
[25] W.-L. Lu, K. Okuma, and J.J. Little, “Tracking and Recognizing Actions of Multiple Hockey Players Using the Boosted Particle Filter,” Image and Vision Computing, vol. 27, nos. 1/2, pp. 189-205, Jan. 2009.
[26] C. Thuran and V. Hlaváč, “Pose Primitive Based Human Action Recognition in Videos or Still Images,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[27] A.F. Bobick and J.W. Davis, “The Recognition of Human Movement Using Temporal Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.
[28] E. Shechtman and M. Irani, “Space-Time Behavior Based Correlation,” Proc. Int'l Conf. Computer Vision and Pattern Recognition, 2005.
[29] H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A Biologically Inspired System for Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[30] K. Schindler and L. Van Gool, “Action Snippets: How Many Frames Does Action Recognition Require?” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[31] M.D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH: A Spatial-Temporal Maximum Average Correlation Height Filter for Action Recognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[32] X. Feng and P. Perona, “Human Action Recognition by Sequence of Movelet Codewords,” Proc. Int'l Symp. 3D Data Processing Visualization and Transmission, 2002.
[33] N. Olivera, A. Garg, and E. Horvitz, “Layered Representations for Learning and Inferring Office Activity from Multiple Sensory Channels,” Computer Vision and Image Understanding, vol. 96, no. 2, pp. 163-180, Nov. 2004.
[34] N. Ikizler and D. Forsyth, “Searching Video for Complex Activities with Finite State Models,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[35] B. Laxton, J. Lim, and D. Kriegman, “Leveraging Temporal, Contextual and Ordering Constraints for Recognizing Complex Activities in Video,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[36] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas, “Conditional Models for Contextual Human Motion Recognition,” Proc. IEEE Int'l Conf. Computer Vision, 2005.
[37] S.B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell, “Hidden Conditional Random Fields for Gesture Recognition,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
[38] I. Laptev and T. Lindeberg, “Space-Time Interest Points,” Proc. IEEE Int'l Conf. Computer Vision, 2003.
[39] C. Schuldt, L. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” Proc. IEEE Int'l Conf. Pattern Recognition, vol. 3, pp. 32-36, 2004.
[40] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proc. IEEE Int'l Conf. Computer Vision Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.
[41] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient Visual Event Detection Using Volumetric Features,” Proc. IEEE Int'l Conf. Computer Vision, vol. 1, pp. 166-173, 2005.
[42] S. Nowozin, G. Bakir, and K. Tsuda, “Discriminative Subsequence Mining for Action Classification,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[43] J. Liu and M. Shah, “Learning Human Actions via Information Maximization,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[44] J.C. Niebles and L. Fei-Fei, “A Hierarchical Model of Shape and Appearance for Human Action Classification,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[45] I. Laptev and P. Pérez, “Retrieving Actions in Movies,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[46] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning Realistic Human Actions from Movies,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[47] J.C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words,” Int'l J. Computer Vision, vol. 79, no. 3, pp. 299-318, Sept. 2008.
[48] S.-F. Wong, T.-K. Kim, and R. Cipolla, “Learning Motion Categories Using both Semantic and Structure Information,” Proc. IEEE CS Conf. Computer Vision and Pattern Recongition, 2007.
[49] B.D. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” Proc. Defense Advanced Research Projects Agency Image Understanding Workshop, pp. 121-130, Apr. 1981.
[50] T.P. Minka, “Estimating a Dirichlet Distribution,” technical report, Massachusetts Inst. of Tech nology, 2000.
[51] J. Huang and T. Malisiewicz, “Fitting a Hierarchical Logistic Normal Distribution,” http://www.cs.cmu.edu/~jch1/research/hln hlnfit.html, 2009.
[52] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as Space-Time Shapes,” Proc. IEEE Int'l Conf. Computer Vision, 2005.
[53] A. Fathi and G. Mori, “Action Recognition by Learning Mid-Level Motion Features,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2008.
[54] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study,” Int'l J. Computer Vision, vol. 73, no. 2, pp. 213-238, 2007.
26 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool