The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.05 - May (2011 vol.33)
pp: 883-897
Andrew Gilbert , University of Surrey, Guildford
John Illingworth , University of Surrey, Guildford
Richard Bowden , University of Surrey, Guildford
ABSTRACT
The field of Action Recognition has seen a large increase in activity in recent years. Much of the progress has been through incorporating ideas from single-frame object recognition and adapting them for temporal-based action recognition. Inspired by the success of interest points in the 2D spatial domain, their 3D (space-time) counterparts typically form the basic components used to describe actions, and in action recognition the features used are often engineered to fire sparsely. This is to ensure that the problem is tractable; however, this can sacrifice recognition accuracy as it cannot be assumed that the optimum features in terms of class discrimination are obtained from this approach. In contrast, we propose to initially use an overcomplete set of simple 2D corners in both space and time. These are grouped spatially and temporally using a hierarchical process, with an increasing search area. At each stage of the hierarchy, the most distinctive and descriptive features are learned efficiently through data mining. This allows large amounts of data to be searched for frequently reoccurring patterns of features. At each level of the hierarchy, the mined compound features become more complex, discriminative, and sparse. This results in fast, accurate recognition with real-time performance on high-resolution video. As the compound features are constructed and selected based upon their ability to discriminate, their speed and accuracy increase at each level of the hierarchy. The approach is tested on four state-of-the-art data sets, the popular KTH data set to provide a comparison with other state-of-the-art approaches, the Multi-KTH data set to illustrate performance at simultaneous multiaction classification, despite no explicit localization information provided during training. Finally, the recent Hollywood and Hollywood2 data sets provide challenging complex actions taken from commercial movie sequences. For all four data sets, the proposed hierarchical approach outperforms all other methods reported thus far in the literature and can achieve real-time operation.
INDEX TERMS
Action recognition, data mining, real-time, learning, spatiotemporal.
CITATION
Andrew Gilbert, John Illingworth, Richard Bowden, "Action Recognition Using Mined Hierarchical Compound Features", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.33, no. 5, pp. 883-897, May 2011, doi:10.1109/TPAMI.2010.144
REFERENCES
[1] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” Proc. Int'l Conf. Pattern Recognition, vol. 3, pp. 32-36, 2004.
[2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning Realistic Human Actions from Movies,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[3] J. Willamowski, D. Arregui, G. Csurka, C.R. Dance, and L. Fan, “Categorizing Nine Visual Classes Using Local Appearance Descriptors,” Proc. IWLAVS, 2004.
[4] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, vol. I, pp. 511-518, 2001.
[5] O. Maron and T. Lozano-Prez, “A Framework for Multiple-Instance Learning,” Advances in Neural Information Processing Systems, pp. 570-576, MIT Press, 1998.
[6] P. Scovanner, S. Ali, and M. Shah, “A 3-Dimensional Sift Descriptor and Its Application to Action Recognition,” Proc. Conf. Multimedia, pp. 357-360, 2007.
[7] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool, “A Comparison of Affine Region Detectors,” Int'l J. Computer Vision, vol. 65, pp. 43-72, 2005.
[8] H. Wang, M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of Local Spatio-Temporal Features for Action Recognition,” Proc. BMVA British Machine Vision Conf., 2009.
[9] T. Tuytelaars and C. Schmid, “Vector Quantizing Feature Space with a Regular Lattice,” Proc. 11th IEEE Int'l Conf. Computer Vision 2007, pp. 1-8, 2007, http://dx.doi.org/10.1109ICCV.2007. 4408924 .
[10] T. Quack, V. Ferrari, B. Leibe, and L. VanGool, “Efficient Mining of Frequent and Distinctive Feature Configurations,” Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[11] A. Gilbert, J. Illingworth, and R. Bowden, “Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-Temporal Corners,” Proc. European Conf. Computer Vision, vol. I, pp. 222-233, 2008.
[12] O. Chum, J. Philbin, and A. Zisserman, “Near Duplicate Image Detection: Min-Hash and tf-idf Weighting,” Proc. BMVA British Machine Vision Conf., 2008.
[13] A. Gilbert, J. Illingworth, and R. Bowden, “Fast Realistic Multi-Action Recognition Using Mined Dense Spatio-Temporal Features,” Proc. IEEE Int'l Conf. Computer Vision, vol. I, pp. 222-233, 2009.
[14] M. Marszalek, I. Laptev, and C. Schmid, “Actions in Context,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2009.
[15] H. Uemura, S. Ishikawa, and K. Mikolajczyk, “Feature Tracking and Motion Compensation for Action Recognition,” Proc. BMVA British Machine Vision Conf., 2008.
[16] S. Lazebnik, C. Schmid, and J. Ponce, “Semi-Local Affine Parts for Object Recognition,” Proc. BMVA British Machine Vision Conf., vol. II pp. 959-968, 2004.
[17] J. Sivic and A. Zisserman, “Video Data Mining Using Configurations of Viewpoint Invariant Regions,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, vol. I, pp. 488-495, 2004.
[18] D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” Int'l J. Computer Vision, vol. 20, pp. 91-110, 2003.
[19] G. Willems, T. Tuytelaars, and L. VanGool, “An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector,” Proc. European Conf. Computer Vision, vol. II, pp. 650-663, 2008.
[20] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proc. 14th Int'l Conf. Computer Comm. and Networks, pp. 65-72, 2005.
[21] J.C. Niebles and L. Fei-Fei, “A Hierarchical Model of Shape and Appearance for Human Action Classification,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2007.
[22] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions as Space-Time Shapes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[23] I. Laptev and P. Pérez, “Retrieving Actions in Movies,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[24] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient Visual Event Detection Using Volumetric Features,” Proc. IEEE Int'l Conf. Computer Vision, 2005.
[25] N. Dalal, B. Triggs, and C. Schmid, “Human Detection Using Oriented Histograms of Flow and Appearance,” Proc. European Conf. Computer Vision, vol. II, pp. 428-441, 2006.
[26] B. Lucas and T. Kanade, “An Iterative Image Registration Technique with an Application to Stereo Vision,” Proc. Seventh Int'l Joint Conf. Artificial Intelligence, pp. 674-679, 1998.
[27] D. Han, L. Bo, and C. Sminchisescu, “Selection and Context for Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision, vol. I, pp. 1933-1940, 2009.
[28] J. Tesic, S. Newsam, and B.S. Manjunath, “Mining Image Datasets Using Perceptual Association Rules,” Proc. SIAM Int'l Conf. Data Mining, Workshop Mining Scientific and Eng. Datasets, p. 7177, 2003.
[29] Q. Ding, Q. Ding, and W. Perrizo, “Association Rule Mining on Remotely Sensed Images Using P-trees,” Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 66-79, 2002.
[30] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” Proc. 20th Int'l Conf. Very Large Data Bases, pp. 487-499, 1994.
[31] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman, “Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, 2007.
[32] J. Yuan, J. Luo, and Y. Wu, “Mining Compositional Features for Boosting,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[33] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. 1993 ACM SIGMOD, pp. 207-216, 1993.
[34] C. Harris and M. Stephens, “A Combined Corner and Edge Detector,” Proc. Alvey Vision Conf., pp. 189-192, 1988.
[35] I. Laptev and T. Lindeberg, “Space-Time Interest Points,” Proc. IEEE Int'l Conf. Computer Vision, pp. 432-439, 2003.
[36] Y. Freund and R.E. Schapire, “Experiments with a New Boosting Algorithm,” Proc. 13th Conf. Machine Learning, pp. 148-156, 1996.
[37] A. Klaser, M. Marszalek, and C. Schmid, “A Spatio-Temporal Descriptor Based on 3D Gradients,” Proc. BMVA British Machine Vision Conf., 2008.
[38] S. Nowozin, G. Bakir, and K. Tsuda, “Discriminative Subsequence Mining for Action Classification,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1919-1923, 2007.
[39] T. Kim, S. Wong, and R. Cipolla, “Tensor Canonical Correlation Analysis for Action Classification,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2007.
[40] T. Zhang, J. Liu, S. Liu, Y. Ouyang, and H. Lu, “Boosted Exemplar Learning for Human Action Recognition,” Proc. Workshop Video-Oriented Object and Event Classification at ICCV '09, vol. I, pp. 538-545, 2009.
[41] J. Liu and M. Shah, “Learning Human Actions via Information Maximization,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2008.
[42] M. Bregonzio, S. Gong, and T. Xiang, “Recognising Actions as Clouds of Space-Time Interest Points,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2009.
[43] M. Yang, F. Lv, W. Xu, K. Yu, and Y. Gong, “Human Action Detection by Boosting Efficient Motion Features,” Proc. Workshop Video-Oriented Object and Event Classification at ICCV '09, vol. I, pp. 522-529, 2009.
[44] S.F. Wong and R. Cipolla, “Extracting Spatio Temporal Interest Points Using Global Information,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
[45] J. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words,” Proc. BMVA British Machine Vision Conf., vol. III, pp. 1249-1259, 2006.
[46] G. Willems, J. Becker, T. Tuytelaars, and L. VanGool, “Exemplar Based Action Recognition in Video,” Proc. BMVA British Machine Vision Conf., 2009.
[47] P. Matikaien, M. Herbert, and R. Sukthankar, “Trajectons: Action Recognition through the Motion Analysis of Tracked Features,” Proc. Workshop Video-Oriented Object and Event Classification at ICCV '09, vol. I, pp. 514-521, 2009.
18 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool