This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
An Extended Grammar System for Learning and Recognizing Complex Visual Events
February 2011 (vol. 33 no. 2)
pp. 240-255
Zhang Zhang, Chinese Academy of Sciences, Beijing
Tieniu Tan, Chinese Academy of Sciences, Beijing
Kaiqi Huang, Chinese Academy of Sciences, Beijing
For a grammar-based approach to the recognition of visual events, there are two major limitations that prevent it from real application. One is that the event rules are predefined by domain experts, which means huge manual cost. The other is that the commonly used grammar can only handle sequential relations between subevents, which is inadequate to recognize more complex events involving parallel subevents. To solve these problems, we propose an extended grammar approach to modeling and recognizing complex visual events. First, motion trajectories as original features are transformed into a set of basic motion patterns of a single moving object, namely, primitives (terminals) in the grammar system. Then, a Minimum Description Length (MDL) based rule induction algorithm is performed to discover the hidden temporal structures in primitive stream, where Stochastic Context-Free Grammar (SCFG) is extended by Allen's temporal logic to model the complex temporal relations between subevents. Finally, a Multithread Parsing (MTP) algorithm is adopted to recognize interesting complex events in a given primitive stream, where a Viterbi-like error recovery strategy is also proposed to handle large-scale errors, e.g., insertion and deletion errors. Extensive experiments, including gymnastic exercises, traffic light events, and multi-agent interactions, have been executed to validate the effectiveness of the proposed approach.

[1] T. Syeda Mahmood, I. Haritaoglu, and T. Huang, "Special Issue on Event Detection in Video," Computer Vision and Image Understanding, vol. 96, 2004.
[2] P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, "Machine Recognition of Human Activities: A Survey," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1473-1488, Nov. 2008.
[3] K.-S. Fu, Syntactic Pattern Recognition and Applications. Prentice-Hall, 1982.
[4] J.C. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words," Int'l J. Computer Vision, vol. 79, no. 3, pp. 299-318, 2008.
[5] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. Joint IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.
[6] I. Laptev and T. Lindeberg, "Space-Time Interest Points," Proc. IEEE Int'l Conf. Computer Vision, 2003.
[7] A. Yilmaz and M. Shah, "A Differential Geometric Approach to Representing the Human Actions," Computer Vision and Image Understanding, vol. 109, no. 3, pp. 335-351, 2008.
[8] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[9] D. Weinland, R. Ronfard, and E. Boyer, "Free Viewpoint Action Recognition Using Motion History Volumes," Computer Vision and Image Understanding, vol. 104, no. 2, pp. 249-257, 2006.
[10] X. Wang, X. Ma, and E. Grimson, "Unsupervised Activity Perception by Hierarchical Bayesian Models," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2007.
[11] N. Oliver, B. Rosario, and A. Pentland, "A Bayesian Computer Vision System for Modeling Human Interactions," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831-843, Aug. 2000.
[12] B. Laxton, J. Lim, and D. Kriegman, "Leveraging Temporal, Contextual and Ordering Constraints for Recognizing Complex Activities in Video," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2007.
[13] Y. Shi, Y. Huang, D. Minnen, A. Bobick, and I. Essa, "Propagation Networks for Recognition of Partially Ordered Sequential Action," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2004.
[14] N.T. Nguyen, D.Q. Phung, S. Venkatesh, and H. Bui, "Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Model," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2005.
[15] S. Gong and T. Xiang, "Recognition of Group Activities Using Dynamic Probabilistic Networks," Proc. IEEE Int'l Conf. Computer Vision, 2003.
[16] C. Stauffer and W.E.L. Grimson, "Learning Patterns of Activity Using Real-Time Tracking," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 747-757, Aug. 2000.
[17] F. Porikli and T. Haga, "Event Detection by Eigenvector Decomposition Using Object and Frame Features," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition Workshop, 2004.
[18] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, "A System for Learning Statistical Motion Patterns," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1450-1464, Sept. 2006.
[19] J. Bilmes, "A Gentle Tutorial on the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models," Technical Report TR-97-021, Int'l Computer Science Inst., 1997.
[20] C. Qing, N.D. Georganas, and E.M. Petriu, "Hand Gesture Recognition Using Haar-Like Features and a Stochastic Context-Free Grammar," IEEE Trans. Instrumentation and Measurement, vol. 57, no. 8, pp. 1562-1571, Aug. 2008.
[21] Y.A. Ivanov and A.F. Bobick, "Recognition of Visual Activities and Interactions by Stochastic Parsing," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 852-872, Aug. 2000.
[22] D. Moore and I. Essa, "Recognizing Multitasked Activities from Video Using Stochastic Context-Free Grammar," Proc. Am. Assoc. Artificial Intelligence, 2002.
[23] D. Minnen, I. Essa, and T. Starner, "Expectation Grammars: Leveraging High-Level Expectations for Activity Recognition," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, vol. 2, pp. 626-632, 2003.
[24] K.M. Kitani, Y. Sato, and A. Sugimoto, "Deleted Interpolation Using a Hierarchical Bayesian Grammar Network for Recognizing Human Activity," Proc. IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.
[25] M.S. Ryoo and J.K. Aggarwal, "Recognition of Composite Human Activities through Context-Free Grammar Based Representation," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2006.
[26] S. Joo and R. Chellappa, "Attribute Grammar-Based Event Recognition and Anomaly Detection," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition Workshop, 2006.
[27] M. Yamamoto, H. Mitomi, F. Fujiwara, and T. Sato, "Bayesian Classification of Task-Oriented Actions Based on Stochastic Context-Free Grammar," Proc. Int'l Conf. Automatic Face and Gesture Recognition, 2006.
[28] Z. Zhang, K. Huang, and T. Tan, "Complex Activity Representation and Recognition by Extended Stochastic Grammar," Proc. Asian Conf. Computer Vision, 2006.
[29] C.J. Needham, P.E. Santos, D.R. Magee, V. Devin, D.C. Hogg, and A.G. Cohn, "Protocols from Perceptual Observations," Artificial Intelligence, vol. 167, pp. 103-136, 2005.
[30] R. Hamid, S. Maddi, A. Johnson, A. Bobick, and I. Essa, "Discovery and Characterization of Activities from Event-Streams," Proc. Conf. Uncertainty in Artificial Intelligence, 2005.
[31] A. Hakeem and M. Shah, "Learning, Detection and Representation of Multi-Agent Events in Videos," Artificial Intelligence, vol. 171, nos. 8-9, pp. 586-605, 2007.
[32] M. Fleischman, P. Decamp, and D. Roy, "Mining Temporal Patterns of Movement for Video Content Classification," Proc. ACM Int'l Workshop Multimedia Information Retrieval, 2006.
[33] R. Nevatia, T. Zhao, and S. Hongeng, "Hierarchical Language-Based Representation of Events in Video Streams," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition Workshop Event Mining, 2003.
[34] A. Toshev, F. Bremond, and M. Thonnat, "An Apriori-Based Method for Frequent Composite Event Discovery in Videos," Proc. IEEE Int'l Conf. Computer Vision Systems, 2006.
[35] A. Galata, A. Cohn, D. Magee, and D. Hogg, "Modeling Interaction Using Learned Qualitative Spatio-Temporal Relations and Variable Length Markov Models," Proc. 15th European Conf. Artificial Intelligence, 2002.
[36] P. Grunwald, "A Minimum Description Length Approach to Grammar Inference," Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing, pp. 203-216, Springer Verlag, 1996.
[37] F. Hoppner and F. Klawonn, "Finding Informative Rules in Interval Sequences," Proc. Fourth Int'l Conf. Advances in Intelligent Data Analysis, pp. 123-132, 2001.
[38] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Company, 1989.
[39] C.E. Shannon, A Mathematical Theory of Communication. Univ. of Illinois Press, 1949.
[40] Z. Zhang, K.Q. Huang, T.N. Tan, and L.S. Wang, "Trajectory Series Analysis Based Event Rule Induction for Visual Surveillance," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2007.
[41] Z. Zhang, K.Q. Huang, and T.N. Tan, "Multi-Thread Parsing for Recognizing Complex Events in Videos," Proc. European Conf. Computer Vision, 2008.
[42] T. Yang, S. Li, Q. Pan, and J. Li, "Real-Time Multiple Objects Tracking with Occlusion Handling in Dynamic Scenes," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2005.
[43] J.F. Allen and F. Ferguson, "Actions and Events in Interval Temporal Logical," J. Logic and Computation, vol. 4, no. 5, pp. 531-579, 1994.
[44] M. Johnston, "Unification-Based Multimodal Parsing," Proc. 36th Ann. Meeting of the Assoc. for Computational Linguistics and 17th Int'l Conf. Computational Linguistics, pp. 624-630, 1998.
[45] J.C. Amengual and E. Vidal, "Efficient Error-Correcting Viterbi Parsing," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 10, pp. 1109-1116, Oct. 1998.
[46] A. Stolcke, "An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities," Computational Linguistics, vol. 21, no. 2, pp. 165-201, 1995.
[47] C.G.M. Snoek and M. Worring, "Multimedia Event-Based Video Indexing Using Time Intervals," IEEE Trans. Multimedia, vol. 7, no. 4, pp. 638-647, Aug. 2005.
[48] A.K. Jain, M.N. Murty, and P.J. Flynn, "Data Clustering: A Review," ACM Computing Surveys, vol. 31, pp. 264-323, 1999.
[49] J.C. Dunn, "Well-Separated Clusters and the Optimal Fuzzy Partitions," J. Cybernetics, vol. 4, pp. 95-104, 1974.
[50] CASIA action database, http://www.cbsr.ia.ac.cn/english Actionpercent20Databases%20EN.asp , 2009.

Index Terms:
Rule induction, parsing, event recognition.
Citation:
Zhang Zhang, Tieniu Tan, Kaiqi Huang, "An Extended Grammar System for Learning and Recognizing Complex Visual Events," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 240-255, Feb. 2011, doi:10.1109/TPAMI.2010.60
Usage of this product signifies your acceptance of the Terms of Use.