This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Discriminative Video Pattern Search for Efficient Action Detection
September 2011 (vol. 33 no. 9)
pp. 1728-1743
Junsong Yuan, Nanyang Technological University, Singapore
Zicheng Liu, Microsoft Research, Redmond
Ying Wu, Northwestern University, Evanston
Actions are spatiotemporal patterns. Similar to the sliding window-based object detection, action detection finds the reoccurrences of such spatiotemporal patterns through pattern matching, by handling cluttered and dynamic backgrounds and other types of action variations. We address two critical issues in pattern matching-based action detection: 1) the intrapattern variations in actions, and 2) the computational efficiency in performing action pattern search in cluttered scenes. First, we propose a discriminative pattern matching criterion for action classification, called naive Bayes mutual information maximization (NBMIM). Each action is characterized by a collection of spatiotemporal invariant features and we match it with an action class by measuring the mutual information between them. Based on this matching criterion, action detection is to localize a subvolume in the volumetric video space that has the maximum mutual information toward a specific action class. A novel spatiotemporal branch-and-bound (STBB) search algorithm is designed to efficiently find the optimal solution. Our proposed action detection method does not rely on the results of human detection, tracking, or background subtraction. It can handle action variations such as performing speed and style variations as well as scale changes well. It is also insensitive to dynamic and cluttered backgrounds and even to partial occlusions. The cross-data set experiments on action detection, including KTH, CMU action data sets, and another new MSR action data set, demonstrate the effectiveness and efficiency of the proposed multiclass multiple-instance action detection method.

[1] C.H. Lampert, M.B. Blaschko, and T. Hofmann, "Beyond Sliding Windows: Object Localization by Efficient Subwindow Search," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[2] P. Viola and M.J. Jones, "Robust Real-Time Face Detection," Int'l J. Computer Vision, vol. 57, no. 2, pp. 137-154, 2004.
[3] M.B. Blaschko and C.H. Lampert, "Learning to Localize Objects with Structured Output Regression," Proc. European Conf. Computer Vision, pp. 2-15, 2008.
[4] Y. Ke, R. Sukthankar, and M. Hebert, "Event Detection in Crowded Videos," Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[5] A.F. Bobick and J.W. Davis, "The Recognition of Human Movement Using Temporal Templates," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.
[6] I. Laptev, "On Space-Time Interest Points," Int'l J. Computer Vision, vol. 64, nos. 2-3, pp. 107-123, 2005.
[7] C. Rao, A. Yilmaz, and M. Shah, "View-Invariant Representation and Recognition of Actions," Int'l J. Computer Vision, vol. 50, no. 2, pp. 203-226, 2002.
[8] N. Nguyen, D. Phung, S. Venkatesh, and H. Bui, "Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models," Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 2, 2005.
[9] S. Ali, A. Basharat, and M. Shah, "Chaotic Invariants for Human Action Recognition," Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[10] V. Parameswaran and R. Chellappa, "View Invariance for Human Action Recognition," Int'l J. Computer Vision, vol. 66, no. 1, pp. 83-101, 2006.
[11] J. Sun, X. Wu, S. Yan, L. Cheong, T. Chua, and J. Li, "Hierarchical Spatio-Temporal Context Modeling for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2004-2011, 2009.
[12] F. Lv and R. Nevatia, "Single View Human Action Recognition Using Key Pose Matching and Viterbi Path Searching," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[13] D. Weinland and E. Boyer, "Action Recognition Using Exemplar-Based Embedding," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-7, 2008.
[14] A. Yilmaz and M. Shah, "Actions as Objects: A Novel Action Representation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[15] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," Proc. IEEE Int'l Conf. Computer Vision, vol. 2, pp. 1395-1402, 2005.
[16] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. IEEE Int'l Workshop Visual Surveillance Performance Evaluation Tracking Surveillance, pp. 65-72, 2005.
[17] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos 'in the Wild'," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1996-2003, 2009.
[18] J. Liu and M. Shah, "Learning Human Actions via Information Maximization," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[19] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning Realistic Human Actions from Movies," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[20] K. Jia and D.-Y. Yeung, "Human Action Recognition Using Local Spatio-Temporal Discriminant Embedding," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[21] J. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words," Int'l J. Computer Vision, vol. 79, no. 3, pp. 299-318, 2008.
[22] I. Laptev, B. Caputo, C. Schu, and T. Lindeberg, "Local Velocity-Adapted Motion Events for Spatio-Temporal Recognition," Computer Vision and Image Understanding, vol. 109, no. 1, pp. 207-229, 2007.
[23] P.S. Dhillon, S. Nowozin, and C.H. Lampert, "Combining Appearance and Motion for Human Action Classification in Videos," technical report, Max-Planck-Inst. for Biological Cybernetics, 2008.
[24] P. Scovanner, S. Ali, and M. Shah, "A 3-Dimensional Sift Descriptor and Its Application to Action Recognition," Proc. ACM Multimedia, 2007.
[25] Y. Wang and G. Mori, "Human Action Recognition by Semi-Latent Topic Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1762-1774, Oct. 2009.
[26] S. Ali and M. Shah, "Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 2, pp. 288-303, Feb. 2010.
[27] A. Fathi and G. Mori, "Action Recognition by Learning Mid-Level Motion Features," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[28] A.A. Efros, A.C. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," Proc. IEEE Int'l Conf. Computer Vision, vol. 2, 2003.
[29] Z. Zhang, Y. Hu, S. Chan, and L.-T. Chia, "Motion Context: A New Representation for Human Action Recognition," Proc. European Conf. Computer Vision, pp. 817-829, 2008.
[30] P. Natarajan and R. Nevatia, "View and Scale Invariant Action Recognition Using Multiview Shape-Flow Models," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[31] S.N. Vitaladevuni, V. Kellokumpu, and L.S. Davis, "Action Recognition Using Ballistic Dynamics," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[32] Z. Lin, Z. Jiang, and L.S. Davis, "Recognizing Actions by Shape-Motion Prototype Trees," Proc. IEEE Int'l Conf. Computer Vision, pp. 444-451, 2009.
[33] J. Liu, S. Ali, and M. Shah, "Recognizing Human Actions Using Multiple Features," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[34] D. Han, L. Bo, and C. Sminchisescu, "Selection and Context for Action Recognition," Proc. IEEE Int'l Conf. Computer Vision, pp. 1933-1940, 2009.
[35] S.-F. Wong, T.-K. Kim, and R. Cipolla, "Learning Motion Categories Using Both Semantic and Structural Information," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-6, 2007.
[36] C. Schüldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach," Proc. 17th Int'l Conf. Pattern Recognition, vol. 3, pp. 32-36, Aug. 2004.
[37] K.K. Reddy, J. Liu, and M. Shah, "Incremental Action Recognition Using Feature-Tree," Proc. IEEE Int'l Conf. Computer Vision, pp. 1010-1017, 2009.
[38] Y. Ke, R. Sukthankar, and M. Hebert, "Efficient Visual Event Detection Using Volumetric Features," Proc. IEEE Int'l Conf. Computer Vision, vol. 1, pp. 166-173, 2005.
[39] E. Shechtman and M. Irani, "Space-Time Behavior Based Correlation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp. 405-412, 2005.
[40] J. Yuan, Z. Liu, and Y. Wu, "Discriminative Subvolume Search for Efficient Action Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2442-2449, 2009.
[41] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T.S. Huang, "Action Detection in Complex Scenes with Spatial and Temporal Ambiguities," Proc. IEEE Int'l Conf. Computer Vision, pp. 128-135, 2009.
[42] J. Yuan and Z. Liu, "TechWare: Video-Based Human Action Detection Sources," IEEE Signal Processing Magazine, vol. 27, no. 5, pp. 136-139, Sept. 2010.
[43] C. Yeo, P. Ahammad, K. Ramchandran, and S.S. Sastry, "High-Speed Action Recognition and Localization in Compressed Domain Videos," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 8, pp. 1006-1015, Aug. 2008.
[44] J. Yuan, Z. Liu, Y. Wu, and Z. Zhang, "Speeding Up Spatio-Temporal Sliding-Window Search for Efficient Event Detection in Crowded Videos," Proc. ACM Multimedia Workshop Events in Multimedia, 2009.
[45] M.D. Rodriguez, J. Ahmed, and M. Shah, "Action MACH: A Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[46] D. Weinland, R. Ronfard, and E. Boyer, "Free Viewpoint Action Recognition Using Motion History Volumes," Computer Vision and Image Understanding, vol. 104, nos. 2-3, pp. 207-229, 2006.
[47] H. Jiang, M.S. Drew, and Z.-N. Li, "Successive Convex Matching for Action Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1646-1653, 2006.
[48] W. Li, Z. Zhang, and Z. Liu, "Expandable Data-Driven Graphical Modeling of Human Actions Based on Salient Postures," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1499-1510, Nov. 2008.
[49] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, "Automatic Annotation of Human Actions in Videos," Proc. IEEE Int'l Conf. Computer Vision, pp. 1491-1498, Sept.-Oct. 2009.
[50] I. Laptev and P. Pérez, "Retrieving Actions in Movies," Proc. IEEE Int'l Conf. Computer Vision, 2007.
[51] L. Cao, Z. Liu, and T.S. Huang, "Cross-Data Set Action Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[52] O. Boiman, E. Shechtman, and M. Irani, "In Defense of Nearest-Neighbor Based Image Classification," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[53] C.H. Lampert, "Detecting Objects in Large Image Collections and Videos by Efficient Subimage Retrieval," Proc. IEEE Int'l Conf. Computer Vision, pp. 987-994, 2009.
[54] P.C. Woodland and D. Povey, "Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition," Computer Speech and Language, vol. 16, no. 1, pp. 25-47, 2002.
[55] M. Datar, N. Immorlica, P. Indyk, and V. Mirrokni, "Locality-Sensitive Hashing Scheme Based on P-Stable Distribution," Proc. 20th Ann. Symp. Computational Geometry, pp. 253-262, 2004.
[56] J. Bentley, "Programming Pearls," Algorithm Design Techniques, vol. 27, no. 9, pp. 865-871, 1984.
[57] M. Dikmen et al., "Surveillance Event Detection," Proc. Video Evaluation Workshop, 2008.

Index Terms:
Video pattern search, action detection, spatiotemporal branch-and-bound search.
Citation:
Junsong Yuan, Zicheng Liu, Ying Wu, "Discriminative Video Pattern Search for Efficient Action Detection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 9, pp. 1728-1743, Sept. 2011, doi:10.1109/TPAMI.2011.38
Usage of this product signifies your acceptance of the Terms of Use.