The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - Nov. (2013 vol.35)
pp: 2782-2795
A. Gaidon , Xerox Res. Centre Eur., Meylan, France
Z. Harchaoui , INRIA Grenoble Rhone-Alpes, Montbonnot, France
C. Schmid , INRIA Grenoble Rhone-Alpes, Montbonnot, France
ABSTRACT
We address the problem of localizing actions, such as opening a door, in hours of challenging video data. We propose a model based on a sequence of atomic action units, termed "actoms," that are semantically meaningful and characteristic for the action. Our actom sequence model (ASM) represents an action as a sequence of histograms of actom-anchored visual features, which can be seen as a temporally structured extension of the bag-of-features. Training requires the annotation of actoms for action examples. At test time, actoms are localized automatically based on a nonparametric model of the distribution of actoms, which also acts as a prior on an action's temporal structure. We present experimental results on two recent benchmarks for action localization "Coffee and Cigarettes" and the "DLSBP" dataset. We also adapt our approach to a classification-by-localization set-up and demonstrate its applicability on the challenging "Hollywood 2" dataset. We show that our ASM method outperforms the current state of the art in temporal action localization, as well as baselines that localize actions with a sliding window method.
INDEX TERMS
Training, Hidden Markov models, Visualization, Spatiotemporal phenomena, Adaptation models, Support vector machines, Histograms,actoms, Action recognition, video analysis, temporal localization
CITATION
A. Gaidon, Z. Harchaoui, C. Schmid, "Temporal Localization of Actions with Actoms", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 11, pp. 2782-2795, Nov. 2013, doi:10.1109/TPAMI.2013.65
REFERENCES
[1] R. Poppe, "A Survey on Vision-Based Human Action Recognition," Image and Vision Computing, vol. 28, pp. 976-990, 2010.
[2] D. Weinland, R. Ronfard, and E. Boyer, "A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition," Computer Vision and Image Understanding, vol. 115, pp. 224-241, 2010.
[3] J.K. Aggarwal and M.S. Ryoo, "Human Activity Analysis: A Review," ACM Computer Surveys, vol. 43, 2011.
[4] C. Schüldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach," Proc. 17th Int'l Conf. Pattern Recognition, 2004.
[5] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[6] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, "Learning Realistic Human Actions from Movies," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[7] J.C. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words," Int'l J. Computer Vision, vol. 79, pp. 299-318, 2008.
[8] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, "Automatic Annotation of Human Actions in Video," Proc. IEEE Int'l Conf. Computer Vision, 2009.
[9] S. Satkin and M. Hebert, "Modeling the Temporal Extent of Actions," Proc. 11th European Conf. Computer Vision, 2010.
[10] H. Wang, A. Kläser, C. Schmid, and L. Cheng-Lin, "Action Recognition by Dense Trajectories," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[11] I. Laptev and P. Pérez, "Retrieving Actions in Movies," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[12] M. Marszalek, I. Laptev, and C. Schmid, "Actions in Context," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[13] J. Yamato, J. Ohaya, and K. Ishii, "Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1992.
[14] M. Brand, N. Oliver, and A. Pentland, "Coupled Hidden Markov Models for Complex Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1997.
[15] N.M. Oliver, B. Rosario, and A.P. Pentland, "A Bayesian Computer Vision System for Modeling Human Interactions," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831-843, Aug. 2000.
[16] B. Laxton, J. Lim, and D. Kriegman, "Leveraging Temporal, Contextual and Ordering Constraints for Recognizing Complex Activities in Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
[17] F. Lv and R. Nevatia, "Single View Human Action Recognition Using Key Pose Matching and Viterbi Path Searching," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
[18] C.C. Chen and J.K. Agarwal, "Modeling Human Activities as Speech," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[19] Q. Shi, L. Cheng, L. Wang, and A. Smola, "Human Action Segmentation and Recognition Using Discriminative Semi-Markov Models," Int'l J. Computer Vision, vol. 93, pp. 22-32, 2011.
[20] M. Hoai, Z.Z. Lan, and F. De la Torre, "Joint Segmentation and Classification of Human Actions in Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[21] K. Tang, L. Fei-Fei, and D. Koller, "Learning Latent Temporal Structure for Complex Event Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012.
[22] L.R. Rabiner and R.W. Schafer, "Introduction to Digital Speech Processing," Foundations and Trends in Signal Processing, vol. 1, pp. 1-194, 2007.
[23] T. Darrell and A. Pentland, "Space-Time Gestures," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1993.
[24] A. Veeraraghavan, R. Chellappa, and A.K. Roy-Chowdhury, "The Function Space of an Activity," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006.
[25] W. Brendel and S. Todorovic, "Activities as Time Series of Human Postures," Proc. 11th European Conf. Computer Vision, 2010.
[26] H. Sakoe and S. Chiba, "Dynamic Programming Algorithm Optimization for Spoken Word Recognition," IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 26, no. 1, pp. 43-49, Feb. 1978.
[27] M. Brand and V. Kettnaker, "Discovery and Segmentation of Activities in Video," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 844-851, Aug. 2000.
[28] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[29] T.K. Kim and R. Cipolla, "Canonical Correlation Analysis of Video Volume Tensors for Action Categorization and Detection," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 8, pp. 1415-1428, Aug. 2009.
[30] A. Bobick and J. Davis, "The Recognition of Human Movement Using Temporal Templates," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267, Mar. 2001.
[31] M.D. Rodriguez, J. Ahmed, and M. Shah, "Action Mach: A Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[32] R. Polana and R. Nelson, "Low Level Recognition of Human Motion," Proc. IEEE Workshop Nonrigid and Articulate Motion, 1994.
[33] E. Shechtman and M. Irani, "Space-Time Behavior based Correlation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[34] A.A. Efros, A.C. Berg, G. Mori, and J. Malik, "Recognizing Action at a Distance," Proc. Ninth IEEE Int'l Conf. Computer Vision, 2003.
[35] K. Schindler and L. Van Gool, "Action Snippets: How Many Frames Does Human Action Recognition Require," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[36] Y. Ke, R. Sukthankar, and M. Hebert, "Volumetric Features for Video Event Detection," Int'l J. Computer Vision, vol. 88, pp. 339-362, 2010.
[37] A.D. Wilson and A.F. Bobick, "Parametric Hidden Markov Models for Gesture Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 9, pp. 884-900, Sept. 1999.
[38] O. Chomat and J.L. Crowley, "Probabilistic Recognition of Activity Using Local Appearance," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 1999.
[39] L. Zelnik-Manor and M. Irani, "Event-Based Analysis of Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001.
[40] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005.
[41] I. Laptev, "On Space-Time Interest Points," Int'l J. Computer Vision, vol. 64, pp. 107-123, 2005.
[42] G. Willems, T. Tuytelaars, and L. Van Gool, "An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector," Proc. 10th European Conf. Computer Vision, 2008.
[43] D. Han, L. Bo, and C. Sminchisescu, "Selection and Context for Action Recognition," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[44] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos 'The Wild'," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[45] A. Gilbert, J. Illingworth, and R. Bowden, "Action Recognition Using Mined Hierarchical Compound Features," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 883-897, May 2010.
[46] A. Gaidon, M. Marszałek, and C. Schmid, "Mining Visual Actions from Movies," Proc. British Machine Vision Conf., 2009.
[47] A. Patron-Perez, M. Marszałek, A. Zisserman, and I.D. Reid, "High Five: Recognising Human Interactions in TV Shows," Proc. British Machine Vision Conf., 2010.
[48] N. Ikizler-Cinbis, R.G. Cinbis, and S. Sclaroff, "Learning Actions from the Web," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[49] H. Wang, M.M. Ullah, A. Kläser, I. Laptev, and C. Schmid, "Evaluation of Local Spatio-Temporal Features for Action Recognition," Proc. British Machine Vision Conf., 2009.
[50] S. Nowozin, G. Bakir, and K. Tsuda, "Discriminative Subsequence Mining for Action Classification," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[51] G. Willems, J.H. Becker, T. Tuytelaars, and L. Van Gool, "Exemplar-Based Action Recognition in Video," Proc. British Machine Vision Conf., 2009.
[52] A. Yao, J. Gall, and L. Van Gool, "A Hough Transform-based Voting Framework for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[53] J. Yuan, Z. Liu, and Y. Wu, "Discriminative Video Pattern Search for Efficient Action Detection," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 6, pp. 1728-1743, Sept. 2011.
[54] J. Sivic and A. Zisserman, "Video Google: A Text Retrieval Approach to Object Matching in Videos," Proc. Ninth IEEE Int'l Conf. Computer Vision, 2003.
[55] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, "Visual Categorization with Bags of Keypoints," Proc. European Conf. Computer Vision Workshop Statistical Learning in Computer Vision, 2004.
[56] B. Schölkopf and A.J. Smola, Learning with Kernels. MIT Press, 2002.
[57] J.C. Niebles, C. Chen, and L. Fei-Fei, "Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification," Proc. 11th European Conf. Computer Vision, 2010.
[58] M. Raptis, I. Kokkinos, and S. Soatto, "Discovering Discriminative Action Parts from Mid-Level Video Representations," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012.
[59] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, Sept. 2009.
[60] P. Ekman and W.V. Friesen, Facial Action Coding System. Consulting Psychologists Press, 1978.
[61] J. Cohn and T. Kanade, "Use of Automated Facial Image Analysis for Measurement of Emotion Expression," Handbook of Emotion Elicitation and Assessment, Oxford Univ. Press, 2006.
[62] T. Simon, M. Nguyen, F. De la Torre, and J. Cohn, "Action Unit Detection with Segment-Based SVM," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[63] A. Gaidon, Z. Harchaoui, and C. Schmid, "Actom Sequence Models for Efficient Action Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[64] I. Laptev, "Spatio-Temporal Interest Point Library," www.di.ens. fr/~laptevinterestpoints.html , 2011.
[65] M. Hein and O. Bousquet, "Hilbertian Metrics and Positive Definite Kernels on Probability Measures," Proc. Int'l Conf. Artificial Intelligence and Statistics, 2005.
[66] S. Maji, A. Berg, and J. Malik, "Classification Using Intersection Kernel Support Vector Machines Is Efficient," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[67] Y. Lin, G. Wahba, H. Zhang, and Y. Lee, "Statistical Properties and Adaptive Tuning of Support Vector Machines," Machine Learning, vol. 48, pp. 115-136, 2002.
[68] J. Platt, "Probabilistic Outputs for Support Vector Machines," Advances in Large Margin Classifiers, P. Bartlett, B. Schoelkopf, D. Schurmans, and A.J. Smola, eds., MIT Press, 2000.
[69] H.T. Lin, C.J. Lin, and C. Weng, "A Note on Platt's Probabilistic Outputs for Support Vector Machines," Machine Learning, vol. 68, pp. 267-276, 2007.
[70] M. Rosenblatt, "Remarks on Some Nonparametric Estimates of a Density Function," The Annals of Math. Statistics, vol. 27, pp. 832-837, 1956.
[71] L. Wasserman, All of Statistics: A Concise Course in Statistical Inference. Springer Verlag, 2004.
[72] D.W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley, 1992.
[73] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[74] A. Kläser, M. Marszałek, C. Schmid, and A. Zisserman, "Human Focused Action Localization in Video," Proc. Int'l Conf. Sign, Gesture, and Activity, 2010.
[75] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, "The Pascal Visual Object Classes (VOC) Challenge," Int'l J. Computer Vision, vol. 88, pp. 303-338, 2010.
[76] A. Gaidon, Z. Harchaoui, and C. Schmid, "Recognizing Activities with Cluster-Trees of Tracklets," Proc. British Machine Vision Conf., 2012.
82 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool