The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - Dec. (2012 vol.34)
pp: 2441-2453
A. Patron-Perez , Dept. of Comput. Sci., George Washington Univ., Washington, DC, USA
M. Marszalek , Google, Inc., Adliswil, Switzerland
I. Reid , Dept. of Eng. Sci., Univ. of Oxford, Oxford, UK
A. Zisserman , Dept. of Eng. Sci., Univ. of Oxford, Oxford, UK
ABSTRACT
The objective of this work is recognition and spatiotemporal localization of two-person interactions in video. Our approach is person-centric. As a first stage we track all upper bodies and heads in a video using a tracking-by-detection approach that combines detections with KLT tracking and clique partitioning, together with occlusion detection, to yield robust person tracks. We develop local descriptors of activity based on the head orientation (estimated using a set of pose-specific classifiers) and the local spatiotemporal region around them, together with global descriptors that encode the relative positions of people as a function of interaction type. Learning and inference on the model uses a structured output SVM which combines the local and global descriptors in a principled manner. Inference using the model yields information about which pairs of people are interacting, their interaction class, and their head orientation (which is also treated as a variable, enabling mistakes in the classifier to be corrected using global context). We show that inference can be carried out with polynomial complexity in the number of people, and describe an efficient algorithm for this. The method is evaluated on a new dataset comprising 300 video clips acquired from 23 different TV shows and on the benchmark UT--Interaction dataset.
INDEX TERMS
Magnetic heads, Context awareness, Support vector machines, Video retrieval, Human factors, Tracking, Spatiotemporal phenomena, structured SVM, Human interaction recognition, video retrieval
CITATION
A. Patron-Perez, M. Marszalek, I. Reid, A. Zisserman, "Structured Learning of Human Interactions in TV Shows", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.34, no. 12, pp. 2441-2453, Dec. 2012, doi:10.1109/TPAMI.2012.24
REFERENCES
[1] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[2] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, "Progressive Search Space Reduction for Human Pose Estimation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[3] A. Kläser, M. Marszalek, C. Schmid, and A. Zisserman, "Human Focused Action Localization in Video," Proc. Int'l Workshop Sign, Gesture, and Activity, 2010.
[4] B. Benfold and I. Reid, "Guiding Visual Surveillance by Tracking Human Attention," Proc. British Machine Vision Conf., 2009.
[5] A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid, "High Five: Recognising Human Interactions in TV Shows," Proc. British Machine Vision Conf., 2010.
[6] I. Laptev and P. Perez, "Retrieving Actions in Movies," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[7] M. Marszalek, I. Laptev, and C. Schmid, "Actions in Context," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[8] M.D. Rodriguez, J. Ahmed, and M. Shah, "Action MACH. A Spatio-Temporal Maximum Average Correlation Height Filter for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[9] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos 'in the Wild'," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[10] X. Wu, C.W. Ngo, J. Li, and Y. Zhang, "Localizing Volumetric Motion for Action Recognition in Realistic Videos," Proc. ACM Int'l Conf. Multimedia, 2009.
[11] N. Oliver, B. Rosario, and A. Pentland, "Graphical Models for Recognizing Human Interactions," Proc. Int'l Conf. Neural Information and Processing Systems, 1998.
[12] S. Park and J.K. Aggarwal, "A Hierarchical Bayesian Network for Event Recognition of Human Actions and Interactions," Multimedia Systems, vol. 10, no. 2, pp. 164-179, 2004.
[13] S. Park and J.K. Aggarwal, "Simultaneous Tracking of Multiple Body Parts of Interacting Persons," Computer Vision and Image Understanding, vol. 102, no. 1, pp. 1-21, 2006.
[14] M.S. Ryoo and J.K. Aggarwal, "Recognition of Composite Human Activities through Context-Free Grammar based Representation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006.
[15] M.S. Ryoo and J.K. Aggarwal, "Spatio-Temporal Relationship Match: Video Structure Comparison for Recognition of Complex Human Activities," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[16] F. Yuan, V. Prinet, and J. Yuan, "Middle-Level Representation for Human Activities Recognition: The Role of Spatio-Temporal Relationships," Proc. European Conf. Computer Vision Workshop Human Motion, 2010.
[17] J.K. Aggarwal and M.S. Ryoo, "Human Activity Analysis: A Review," ACM Computing Surveys, vol. 43, no. 3, 2011.
[18] B. Taskar, C. Guestrin, and D. Koller, "Max-Margin Markov Networks," Proc. Neural Information Processing Systems Conf., 2003.
[19] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, "Large Margin Methods for Structured and Interdependent Output Variables," J. Machine Learning Research, vol. 6, pp. 1453-1484, 2005.
[20] M. Blaschko and C. Lampert, "Learning to Localize Objects with Structured Output Regression," Proc. 10th European Conf. Computer Vision, 2008.
[21] M. Blaschko and C. Lampert, "Object Localization with Global and Local Context Kernels," Proc. British Machine Vision Conference, 2009.
[22] C. Desai, D. Ramanan, and C. Fowlkes, "Discriminative Models for Multi-Class Object Layout," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[23] Y. Wang and G. Mori, "A Discriminative Latent Model of Object Classes and Attributes," Proc. 11th European Conf. Computer Vision, 2010.
[24] T. Lan, Y. Wang, W. Yang, and G. Mori, "Beyond Actions: Discriminative Models for Contextual Group Activities," Proc. Neural Information Processing Systems Conf., 2010.
[25] J.C. Niebles, C.W. Chen, and L. Fei-Fei, "Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification," Proc. 11th European Conf. Computer Vision, 2010.
[26] V. Ferrari, M. Marin-Jimenez, and A. Zisserman, "Pose Search: Retrieving People Using Their Pose," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[27] B. Lucas and T. Kanade, "An Iterative Image Registration Technique with an Application to Stereo Vision," Proc. Int'l Joint Conf. Artificial Intelligence, pp. 674-679. 1981,
[28] C. Tomasi and T. Kanade, "Detection and Tracking of Point Features," Technical Report CMU-CS-91-132, Carnegie Mellon Univ., 1991.
[29] M. Everingham, J. Sivic, and A. Zisserman, "Taking the Bite Out of Automatic Naming of Characters in TV Video," Image and Vision Computing, vol. 27, no. 5, pp. 545-559, 2009.
[30] B. Benfold and I. Reid, "Colour Invariant Head Pose Classification in Low Resolution Video," Proc. British Machine Vision Conf., 2008.
[31] T. Joachims, T. Finley, and C. Yu, "Cutting Plane Training of Structural SVMs," Machine Learning, vol. 77, no. 1, pp. 27-59, 2009.
[32] A. Patron-Perez, M. Marszalek, I. Reid, and A. Zisserman, "TV Human Interaction Data Set," http://www.robots.ox.ac.uk/~vgg/datatv_human_interactions , 2010.
[33] M. Everingham, L. Van Gool, C.K.I. Villiams, J. Winn, and A. Zisserman, "The PASCAL Visual Object Classes (VOC) Challenge," Int'l J. Computer Vision, vol. 88, no. 2, pp. 303-338, 2010.
[34] T. Joachims, "Multi-Class Support Vector Machine," http://svmlight.joachims.orgsvm_multiclass.html , 2008.
[35] A. Gilbert, J. Illingworth, and R. Bowden, "Fast Realistic Multi-Action Recognition Using Mined Dense Spatio-Temporal Features," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[36] M.S. Ryoo, C.-C. Chen, J.K. Aggarwal, and A. Roy-Chowdhury, "An Overview of Contest on Semantic Description of Human Activities 2010," Proc. Int'l Conf. Pattern Recognition Contests, 2010.
[37] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, Sept. 2010.
[38] P. Felzenszwalb, R. Girshick, and D. McAllester, "Discriminatively Trained Deformable Part Models, Release 4," http://people.cs.uchicago.edu/pfflatent-release4 /, 2010.
[39] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. Second Joint IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72. 2005,
[40] B. Benfold and I.D. Reid, "Stable Multi-Target Tracking in Real-Time Surveillance Video," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[41] D. Waltisberg, A. Yao, J. Gall, and L. Van Gool, "Variations of a Hough-Voting Action Recognition System," Proc. Int'l Conf. Pattern Recognition Contest on Semantic Description of Human Activities, 2010.
32 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool