The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.11 - November (2008 vol.30)
pp: 1985-1997
Dong Xu , Nanyang Technological University, Singapore
Shih-Fu Chang , Columbia University, New York
ABSTRACT
In this work, we systematically study the problem of event recognition in unconstrained news video sequences. We adopt the discriminative kernel-based method for which video clip similarity plays an important role. First, we represent a video clip as a bag of orderless descriptors extracted from all of the constituent frames and apply the Earth Mover's Distance (EMD) to integrate similarities among frames from two clips. Observing that a video clip is usually comprised of multiple subclips corresponding to event evolution over time, we further build a multi-level temporal pyramid. At each pyramid level, we integrate the information from different subclips with Integer-valueconstrained EMD to explicitly align the subclips. By fusing the information from the different pyramid levels, we develop Temporally Aligned Pyramid Matching (TAPM) for measuring video similarity. We conduct comprehensive experiments on the Trecvid 2005 corpus, which contains more than 6,800 clips. Our experiments demonstrate that 1) the TAPM multi-level method clearly outperforms single-level EMD, and 2) single-level EMD outperforms keyframe and multi-frame based detection methods by a large margin. In addition, we conduct in-depth investigation of various aspects of the proposed techniques, such as weight selection in single-level EMD, sensitivity to temporal clustering, the effect of temporal alignment, and possible approaches for speedup.
INDEX TERMS
Event Recognition, News Video, Concept Ontology, Temporally Aligned Pyramid Matching, Concept-based Video Indexing, Earth Mover's Distance
CITATION
Dong Xu, Shih-Fu Chang, "Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.30, no. 11, pp. 1985-1997, November 2008, doi:10.1109/TPAMI.2008.129
REFERENCES
[1] D. Zhang, D. Perez, S. Bengio, and I. McCowan, “Semi-Supervised Adapted HMMS for Unusual Event Detection,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 611-618, 2005.
[2] P. Peursum, S. Venkatesh, G. West, and H. Bui, “Object Labelling from Human Action Recognition,” Proc. IEEE Int'l Conf. Pervasive Computing and Comm., pp. 399-406, 2003.
[3] M. Brand, N. Oliver, and A. Pentland, “Coupled Hidden Markov Models for Complex Action Recognition,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 994-999, 1997.
[4] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian Computer Vision System for Modeling Human Interactions,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831-843, Aug. 2000.
[5] A. Veeraraghavan, R. Chellappa, and A. Roy-Chowdhury, “The Function Space of an Activity,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 959-968, 2006.
[6] C. Fanti, L. Manor, and P. Perona, “Hybrid Models for Human Motion Recognition,” Proc. IEEE Int'l Conf. Computer Vision, pp.1166-1173, 2005.
[7] O. Boiman and M. Irani, “Detecting Irregularities in Images and in Video,” Proc. IEEE Int'l Conf. Computer Vision, pp. 462-469, 2005.
[8] Y. Ke, R. Sukthankar, and M. Hebert, “Efficient Visual Event Detection Using Volumetric Features,” Proc. IEEE Int'l Conf. Computer Vision, pp. 166-173, 2005.
[9] A. Efros, A. Berg, G. Mori, and J. Malik, “Recognizing Action at a Distance,” Proc. IEEE Int'l Conf. Computer Vision, pp. 726-733, 2003.
[10] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proc. IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.
[11] L. Laptev and T. Lindeberg, “Space-Time Interest Points,” Proc. IEEE Int'l Conf. Computer Vision, pp. 432-439, 2003.
[12] J. Niebles, H. Wang, and F. Li, “Unsupervised Learning of Human Action Categories Using Spatial Temporal Words,” Proc. British Machine Vision Conf., 2005.
[13] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” Proc. IEEE Int'l Conf. Pattern Recognition, pp. 32-36, 2004.
[14] C. Harris and M. Stephens, “A Combined Corner and Edge Detector,” Proc. Alvey Vision Conf., 1988.
[15] “Dto LSCOM Lexicon Definitions and Annotations,” http://www.ee.columbia.edu/dvmmlscom/, 2007.
[16] M. Naphade, J. Smith, J. Tesic, S.-F. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis, “Large-Scale Concept Ontology for Multimedia,” IEEE Multimedia, vol. 13, no. 3, pp. 86-91, July-Sept. 2006.
[17] S. Ebadollahi, L. Xie, S.-F. Chang, and J.R. Smith, “Visual Event Detection Using Multi-Dimensional Concept Dynamics,” Proc. IEEE Int'l Conf. Multimedia and Expo, pp. 881-884, 2006.
[18] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 2169-2178, 2006.
[19] J. Sivic and A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1470-1477, 2003.
[20] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local Features and Kernels for Classification of Texture and Object Categories: A Comprehensive Study,” Int'l J. Computer Vision, vol. 73, no. 2, pp. 213-238, June 2007.
[21] Y. Rubner, C. Tomasi, and L. Guibas, “The Earth Mover's Distance as a Metric for Image Retrieval,” Int'l J. Computer Vision, vol. 40, no. 2, pp. 99-121, Nov. 2000.
[22] R.M. Gray, D.L. Neuhoff, and P.C. Shields, “A Generalization of Ornstein's $\bar{d}$ Distance with Applications to Information Theory,” The Annals of Probability, vol. 3, no. 2, pp. 315-328, Apr. 1975.
[23] E. Levina and P. Bickel, “The Earth Movers Distance Is the Mallows Distance: Some Insights from Statistics,” Proc. IEEE Int'l Conf. Computer Vision, pp. 251-256, 2001.
[24] S.T. Rachev, “The Monge-Kantorovich Mass Transference Problem and Its Stochastic Applications,” Theory of Probability and Its Applications, vol. 29, pp. 647-676, 1984.
[25] K. Grauman and T. Darrell, “The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1458-1465, 2005.
[26] D. Xu and S.-F. Chang, “Visual Event Recognition in News Video Using Kernel Methods with Multi-Level Temporal Alignment,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
[27] A. Amir, J. Argillander, M. Campbell, A. Haubold, G. Iyengar, S. Ebadollahi, F. Kang, M.R. Naphade, A. Natsev, J.R. Smith, J. Tesic, and T. Volkmer, “IBM Research TRECVID-2005 Video Retrieval System,” Proc. NIST TREC Video Retrieval Evaluation Workshop, Nov. 2005.
[28] A. Yanagawa, W. Hsu, and S.-F. Chang, “Brief Descriptions of Visual Features for Baseline TRECVID Concept Detectors,” technical report, Columbia Univ., July 2006.
[29] M. Campbell, A. Hauboldy, S. Ebadollahi, D. Joshi, M.R. Naphade, A. Natsev, J. Seidl, J.R. Smith, K. Scheinberg, J. Tesic, and L. Xie, “IBM Research TRECVID-2006 Video Retrieval System,” Proc. NIST TREC Video Retrieval Evaluation Workshop, Nov. 2006.
[30] J. Cao, Y. Lan, J. Li, Q. Li, X. Li, F. Lin, X. Liu, L. Luo, W. Peng, D. Wang, H. Wang, Z. Wang, Z. Xiang, J. Yuan, W. Zheng, B. Zhang, J. Zhang, L. Zhang, and X. Zhang, “Intelligent Multimedia Group of Tsinghua University at TRECVID 2006,” Proc. NIST TREC Video Retrieval Evaluation Workshop, Nov. 2006.
[31] A.G. Hauptmann, M. Chen, M. Christel, D. Das, W.-H. Lin, R. Yan, J. Yang, G. Backfried, and X. Wu, “Multi-Lingual Broadcast News Retrieval,” Proc. NIST TREC Video Retrieval Evaluation Workshop, Nov. 2006.
[32] J. Liu, Y. Zhai, A. Basharat, B. Orhan, S.M. Khan, H. Noor, P. Berkowitz, and M. Shah, “University of Central Florida at TRECVID 2006 High-Level Feature Extraction and Video Search,” Proc. NIST TREC Video Retrieval Evaluation Workshop, Nov. 2006.
[33] A. Yanagawa, S.-F. Chang, L. Kennedy, and W. Hsu, “Columbia University's Baseline Detectors for 374 LSCOM Semantic Visual Concepts,” technical report, Columbia Univ., Mar. 2007.
[34] “Columbia University's Baseline Detectors for 374 LSCOM Semantic Visual Concepts,” http://www.ee.columbia.edu/ln/dvmmcolumbia374 /, 2007.
[35] D. Lowe, “Object Recognition from Local Scale-Invariant Features,” Proc. IEEE Int'l Conf. Computer Vision, pp. 1150-1157, 1999.
[36] P. Moreno, P. Ho, and N. Vasconcelos, “A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications,” Proc. Neural Information Processing Systems, Dec. 2003.
[37] F. Jing, M. Li, H. Zhang, and B. Zhang, “An Efficient and Effective Region-Based Image Retrieval Framework,” IEEE Trans. Image Processing, vol. 13, no. 5, pp. 699-709, May 2004.
[38] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines, http://www.csie.ntu.edu.tw/~cjlinlibsvm, 2008.
[39] A. Jain, M. Murty, and P. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, Sept. 1999.
[40] P. Jensen and J. Bard, Operations Research Models and Methods. John Wiley & Sons, 2003.
[41] J. Munkres, “Algorithms for the Assignment and Transportation Problems,” J. Soc. for Industrial and Applied Math., vol. 5, no. 1, pp.32-38, Mar. 1957.
[42] J.C. Platt, “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods,” Advances in Large Margin Classifiers, 1999.
[43] A.F. Smeaton, P. Over, and W. Kraaij, “Evaluation Campaigns and TRECVID,” Proc. Eighth ACM Int'l Workshop Multimedia Information Retrieval, pp. 321-330, 2006.
[44] “TRECVID,” http://www-nlpir.nist.gov/projectstrecvid , 2008.
[45] A. Bosch, A. Zisserman, and X. Munoz, “Representing Shape with a Spatial Pyramid Kernel,” Proc. Int'l Conf. Image and Video Retrieval, pp. 401-408, 2007.
[46] M. Varma and D. Ray, “Learning the Discriminative Power-Invariance Trade-Off,” Proc. IEEE Int'l Conf. Computer Vision, 2007.
41 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool