Subscribe
Issue No.08 - August (2009 vol.31)
pp: 1415-1428
Tae-Kyun Kim , University of Cambridge, Cambridge
Roberto Cipolla , University of Cambridge, Cambridge
ABSTRACT
This paper addresses a spatiotemporal pattern recognition problem. The main purpose of this study is to find a right representation and matching of action video volumes for categorization. A novel method is proposed to measure video-to-video volume similarity by extending Canonical Correlation Analysis (CCA), a principled tool to inspect linear relations between two sets of vectors, to that of two multiway data arrays (or tensors). The proposed method analyzes video volumes as inputs avoiding the difficult problem of explicit motion estimation required in traditional methods and provides a way of spatiotemporal pattern matching that is robust to intraclass variations of actions. The proposed matching is demonstrated for action classification by a simple Nearest Neighbor classifier. We, moreover, propose an automatic action detection method, which performs 3D window search over an input video with action exemplars. The search is speeded up by dynamic learning of subspaces in the proposed CCA. Experiments on a public action data set (KTH) and a self-recorded hand gesture data showed that the proposed method is significantly better than various state-of-the-art methods with respect to accuracy. Our method has low time complexity and does not require any major tuning parameters.
INDEX TERMS
Action categorization, gesture recognition, canonical correlation analysis, tensor, action detection, incremental subspace learning, spatiotemporal pattern classification.
CITATION
Tae-Kyun Kim, Roberto Cipolla, "Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.31, no. 8, pp. 1415-1428, August 2009, doi:10.1109/TPAMI.2008.167
REFERENCES
 [1] E. Shechtman and M. Irani, “Space-Time Behavior Based Correlation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp.405-412, 2005. [2] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as Space-Time Shapes,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1395-1402, 2005. [3] A. Bissacco, A. Chiuso, Y. Ma, and S. Soatto, “Recognition of Human Gaits,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 401-417, 2001. [4] A. Veeraraghavan, A. Roy-Chowdhury, and R. Chellappa, “Matching Shape Sequences in Video with Applications in Human Motion Analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1896-1909, Dec. 2005. [5] J.C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words,” Proc. British Machine Vision Conf., 2006. [6] M.J. Black, “Explaining Optical Flow Events with Parameterized Spatio-Temporal Models,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1326-1332, 1999. [7] A.A. Efros, A.C. Berg, G. Mori, and J. Malik, “Recognizing Action at a Distance,” Proc. Ninth IEEE Int'l Conf. Computer Vision, 2003. [8] D. Ramanan and D.A. Forsyth, “Automatic Annotation of Everyday Movements,” Proc. Advances in Neural Information Processing Systems, 2004. [9] C. Rao, A. Yilmaz, and M. Shah, “View-Invariant Representation and Recognition of Actions,” Int'l J. Computer Vision, vol. 50, no. 2, pp. 203-226, 2002. [10] V. Parameswaran and R. Chellappa, “Human Action-Recognition Using Mutual Invariants,” Computer Vision and Image Understanding, vol. 98, no. 2, pp. 294-324, 2005. [11] A. Yilmaz and M. Shah, “Matching Actions in Presence of Camera Motion,” Computer Vision and Image Understanding, vol. 104, no. 2, pp. 221-231, 2006. [12] A. Veeraraghavan, A. Roy-Chowdhury, and R. Chellappa, “The Function Space of an Activity,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006. [13] A. Bobick and J. Davis, “The Recognition of Human Movements Using Temporal Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001. [14] I. Laptev and T. Lindeberg, “Space-Time Interest Points,” Proc. Ninth IEEE Int'l Conf. Computer Vision, pp. 432-439, 2003. [15] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proc. Second Joint IEEE Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005. [16] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” Proc. 17th Int'l Conf. Pattern Recognition, pp. 32-36, 2004. [17] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient Visual Event Detection Using Volumetric Features,” Proc. 10th IEEE Int'l Conf. Computer Vision, pp. 166-173, 2005. [18] S.-F. Wong and R. Cipolla, “Real-Time Interpretation of Hand Motions Using a Sparse Bayesian Classifier on Motion Gradient Orientation Images,” Proc. British Machine Vision Conf., pp. 379-388, 2005. [19] S.-F. Wong, T.-K. Kim, and R. Cipolla, “Learning Motion Categories Using Both Semantic and Structural Information,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007. [20] J.C. Niebles and L. Fei-Fei, “A Hierarchical Model of Shape and Appearance for Human Action Classification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007. [21] T.-K. Kim, J. Kittler, and R. Cipolla, “Discriminative Learning and Recognition of Image Set Classes Using Canonical Correlations,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 1005-1018, June 2007. [22] T.-K. Kim, S.-F. Wong, and R. Cipolla, “Tensor Canonical Correlation Analysis for Action Classification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007. [23] F.R. Bach and M.I. Jordan, “A Probabilistic Interpretation of Canonical Correlation Analysis,” TR 688, Dept. of Statistics, Univ. of California, Berkeley, 2005. [24] M.A.O. Vasilescu and D. Terzopoulos, “Multilinear Analysis of Image Ensembles: TensorFaces,” Proc. Seventh European Conf. Computer Vision, 2002. [25] C. Bauckhage, T. Kaster, and J.K. Tsotsos, “Applying Ensembles of Multilinear Classifiers in the Frequency Domain,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006. [26] S. Yan, D. Xu, Q. Yang, L. Zhang, X. Tang, and H. Zhang, “Discriminant Analysis with Tensor Representation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005. [27] H. Wang and N. Ahuja, “Rank-R Approximation of Tensors Using Image-as-Matrix Representation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006. [28] C.D.M. Martin, Proc. Tensor Decompositions Workshop, 2004. [29] D. Hardoon, S. Szedmak, and J.S. Taylor, “Canonical Correlation Analysis: An Overview with Application to Learning Methods,” Neural Computation, vol. 16, no. 12, pp. 639-2664, 2004. [30] R. Harshman, “Generalization of Canonical Correlation to $N$ -Way Arrays,” Poster at the 34th Ann. Meeting of the Statistical Soc. Canada, May 2006. [31] O. Yamaguchi, K. Fukui, and K. Maeda, “Face Recognition Using Temporal Image Sequence,” Proc. Third IEEE Int'l Conf. Automatic Face and Gesture Recognition, pp. 318-323, 1998. [32] L. Wolf and A. Shashua, “Kernel Principal Angles for Classification Machines with Applications to Image Sequence Interpretation,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2003. [33] H. Hotelling, “Relations between Two Sets of Variates,” Biometrika, vol. 28, no. 34, pp. 321-372, 1936. [34] Å. Björck and G.H. Golub, “Numerical Methods for Computing Angles between Linear Subspaces,” Math. Computation, vol. 27, no. 123, pp. 579-594, 1973. [35] Y. Freund and R.E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” Proc. Second European Conf. Computational Learning Theory, pp. 23-37, 1995. [36] P. Hall, D. Marshall, and R. Martin, “Merging and Splitting Eigenspace Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 9, pp. 1042-1049, Sept. 2000. [37] D. Skocaj and A. Leonardis, “Weighted and Robust Incremental Method for Subspace Learning,” Proc. Ninth IEEE Int'l Conf. Computer Vision, pp. 1494-1501, 2003. [38] D.G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. [39] S. Wang, A. Quattoni, L. Morency, D. Demirdjian, and T. Darrell, “Hidden Conditional Random Fields for Gesture Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006. [40] S. Savarese, “2D and 3D Spatial Reasoning for Object Categorisation,” Tutorial in Int'l Computer Vision Summer School, July 2007. [41] M.A.O. Vasilescu, “Human Motion Signatures: Analysis, Synthesis, Recognition,” Proc. Int'l Conf. Pattern Recognition, pp. 456-460, 2002. [42] T.-K. Kim and R. Cipolla, “Gesture Recognition under Small Sample Size,” Proc. Eighth Asian Conf. Computer Vision, pp. 335-344, 2007. [43] L. Wolf, H. Jhuang, and T. Hazan, “Modeling Appearances with Low-Rank SVM,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007. [44] N. Vaswani, A. Roy-Chowdhury, and R. Chellappa, “Shape Activity: A Continuous-State HMM for Moving/Deforming Shapes with Application to Abnormal Activity Detection,” IEEE Trans. Image Processing, vol. 14, no. 10, pp. 1603-1616, 2005. [45] P. Saisan, G. Doretto, Y.N. Wu, and S. Soatto, “Dynamic Texture Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 58-63, 2001.