The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.09 - Sept. (2012 vol.34)
pp: 1667-1680
Lixin Duan , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Dong Xu , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Ivor Wai-Hung Tsang , Sch. of Comput. Eng., Nanyang Technol. Univ., Singapore, Singapore
Jiebo Luo , Dept. of Comput. Sci., Univ. of Rochester, Rochester, NY, USA
ABSTRACT
We propose a visual event recognition framework for consumer videos by leveraging a large amount of loosely labeled web videos (e.g., from YouTube). Observing that consumer videos generally contain large intraclass variations within the same type of events, we first propose a new method, called Aligned Space-Time Pyramid Matching (ASTPM), to measure the distance between any two video clips. Second, we propose a new transfer learning method, referred to as Adaptive Multiple Kernel Learning (A-MKL), in order to 1) fuse the information from multiple pyramid levels and features (i.e., space-time features and static SIFT features) and 2) cope with the considerable variation in feature distributions between videos from two domains (i.e., web video domain and consumer video domain). For each pyramid level and each type of local features, we first train a set of SVM classifiers based on the combined training set from two domains by using multiple base kernels from different kernel types and parameters, which are then fused with equal weights to obtain a prelearned average classifier. In A-MKL, for each event class we learn an adapted target classifier based on multiple base kernels and the prelearned average classifiers from this event class or all the event classes by minimizing both the structural risk functional and the mismatch between data distributions of two domains. Extensive experiments demonstrate the effectiveness of our proposed framework that requires only a small number of labeled consumer videos by leveraging web data. We also conduct an in-depth investigation on various aspects of the proposed method A-MKL, such as the analysis on the combination coefficients on the prelearned classifiers, the convergence of the learning algorithm, and the performance variation by using different proportions of labeled consumer videos. Moreover, we show that A-MKL using the prelearned classifiers from all the event classes leads to better performance when compared with A-MKL using the prelearned classifiers only from each individual event class.
INDEX TERMS
Videos, Kernel, YouTube, Learning systems, Feature extraction, Visualization, Support vector machines, aligned space-time pyramid matching., Event recognition, transfer learning, domain adaptation, cross-domain learning, adaptive MKL
CITATION
Lixin Duan, Dong Xu, Ivor Wai-Hung Tsang, Jiebo Luo, "Visual Event Recognition in Videos by Learning from Web Data", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.34, no. 9, pp. 1667-1680, Sept. 2012, doi:10.1109/TPAMI.2011.265
REFERENCES
[1] J. Blitzer, R. McDonald, and F. Pereira, "Domain Adaptation with Structural Correspondence Learning," Proc. Conf. Empirical Methods in Natural Language, pp. 120-128, 2006.
[2] K.M. Borgwardt, A. Gretton, M.J. Rasch, H.-P. Kriegel, B. Schölkopf, and A.J. Smola, "Integrating Structured Biological Data by Kernel Maximum Mean Discrepancy," Bioinformatics, vol. 22, no. 4, pp. e49-e57, 2006.
[3] M. Brand, N. Oliver, and A. Pentland, "Coupled Hidden Markov Models for Complex Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 994-999, 1997.
[4] C.-C. Chang and C.-J. Lin, "LIBSVM: A Library for Support Vector Machines," software available at http://www.csie.ntu.edu.tw/cjlinlibsvm, 2001.
[5] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A.C. Loui, and J. Luo, "Large-Scale Multimodal Semantic Concept Detection for Consumer Video," Proc. ACM Int'l Workshop Multimedia Information Retrieval, pp. 255-264, 2007.
[6] H. DauméIII, "Frustratingly Easy Domain Adaptation," Proc. Ann. Meeting Assoc. for Computational Linguistics, pp. 256-263, 2007.
[7] P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, "Behavior Recognition via Sparse Spatio-Temporal Features," Proc. IEEE Int'l Workshop Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65-72, 2005.
[8] L. Duan, I.W. Tsang, D. Xu, and S.J. Maybank, "Domain Transfer SVM for Video Concept Detection," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1375-1381, 2009.
[9] L. Duan, D. Xu, W. Tsang, and T.-S. Chua, "Domain Adaptation from Multiple Sources: A Domain-Dependent Regularization Approach," IEEE Trans. Neural Networks and Learning Systems, vol. 23, no. 3, pp. 504-518, Mar. 2012.
[10] L. Duan, D. Xu, I.W. Tsang, and J. Luo, "Visual Event Recognition in Videos by Learning from Web Data," Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, pp. 1959-1966, 2010.
[11] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce, "Automatic Annotation of Human Actions in Video," Proc. 12th IEEE Int'l Conf. Computer Vision, pp. 1491-1498, 2009.
[12] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, "Actions as Space-Time Shapes," Proc. 10th IEEE Int'l Conf. Computer Vision, pp. 1395-1402, 2005.
[13] K. Grauman and T. Darrell, "The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features," Proc. 10th IEEE Int'l Conf. Computer Vision, pp. 1458-1465, 2005.
[14] J. Hays and A.A. Efros, "Scene Completion Using Millions of Photographs," ACM Trans. Graphics, vol. 26, no. 3,article 4, 2007.
[15] Y. Hu, L. Cao, F. Lv, S. Yan, Y. Gong, and T. Huang, "Action Detection in Complex Scenes with Spatial and Temporal Ambiguities," Proc. 12th IEEE Int'l Conf. Computer Vision, pp. 128-135, 2009.
[16] N. Ikizler-Cinbis, R.G. Cinbis, and S. Sclaroff, "Learning Actions from the Web," Proc. 12th IEEE Int'l Conf. Computer Vision, pp. 995-1002, 2009.
[17] N. Ikizler-Cinbis and S. Sclaroff, "Object, Scene and Actions: Combining Multiple Features for Human Action Recognition," Proc. European Conf. Computer Vision, pp. 494-507, 2010.
[18] P.A. Jensen and J.F. Bard, Operations Research Models and Methods. John Wiley and Sons, 2003.
[19] Y. Ke, R. Sukthankar, and M. Hebert, "Efficient Visual Event Detection Using Volumetric Features," Proc. 10th IEEE Int'l Conf. Computer Vision, pp. 166-173, 2005.
[20] A. Kovashka and K. Grauman, "Learning a Hierarchy of Discriminative Space-Time Neighborhood Features for Human Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2046-2053, 2010.
[21] B. Kulis, K. Saenko, and T. Darrell, "What You Saw Is Not What You Get: Domain Adaptation Using Asymmetric Kernel Transforms," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1785-1792, 2011.
[22] J.T. Kwok and I.W. Tsang, "Learning with Idealized Kernels," Proc. Int'l Conf. Machine Learning, pp. 400-407, 2003.
[23] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M.I. Jordan, "Learning the Kernel Matrix with Semidefinite Programming," J. Machine Learning Research, vol. 5, pp. 27-72, 2004.
[24] I. Laptev and T. Lindeberg, "Space-Time Interest Points," Proc. IEEE Int'l Conf. Computer Vision, pp. 432-439, 2003.
[25] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld, "Learning Realistic Human Actions from Movies," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1-8, 2008.
[26] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2169-2178, 2006.
[27] Z. Lin, Z. Jiang, and L.S. Davis, "Recognizing Actions by Shape-Motion Prototype Trees," Proc. IEEE Int'l Conf. Computer Vision, pp. 444-451, 2009.
[28] J. Liu, J. Luo, and M. Shah, "Recognizing Realistic Actions from Videos 'in the Wild'," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1996-2003, 2009.
[29] Y. Liu, D. Xu, I.W. Tsang, and J. Luo, "Textual Query of Personal Photos Facilitated by Large-Scale Web Data," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 1022-1036, May 2011.
[30] A.C. Loui, J. Luo, S.-F. Chang, D. Ellis, W. Jiang, L. Kennedy, K. Lee, and A. Yanagawa, "Kodak's Consumer Video Benchmark Data Set: Concept Definition and Annotation," Proc. Int'l Workshop Multimedia Information Retrieval, pp. 245-254, 2007.
[31] D.G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[32] J.C. Niebles, H. Wang, and L. Fei-Fei, "Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words," Int'l J. Computer Vision, vol. 79, no. 3, pp. 299-318, 2008.
[33] N. Oliver, B. Rosario, and A. Pentland, "A Bayesian Computer Vision System for Modeling Human Interactions," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831-843, Aug. 2000.
[34] O. Pele and M. Werman, "Fast and Robust Earth Mover's Distances," Proc. IEEE Int'l Conf. Computer Vision, pp. 460-467, 2009.
[35] P. Peursum, S. Venkatesh, G.A.W. West, and H.H. Bui, "Object Labelling from Human Action Recognition," Proc. IEEE Int'l Conf. Pervasive Computing and Comm., pp. 399-406, 2003.
[36] G.-J. Qi, C. Aggarwal, Y. Rui, Q. Tian, S. Chang, and T. Huang, "Towards Cross-Category Knowledge Propagation for Learning Visual Concepts," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 897-904, 2011.
[37] A. Rakotomamonjy, F.R. Bach, S. Canu, and Y. Grandvalet, "SimpleMKL," J. Machine Learning Research, vol. 9, pp. 2491-2521, 2008.
[38] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele, "What Helps Where—and Why? Semantic Relatedness for Knowledge Transfer," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 910-917, 2010.
[39] Y. Rubner, C. Tomasi, and L.J. Guibas, "The Earth Mover's Distance as a Metrix for Image Retrieval," Int'l J. Computer Vision, vol. 40, no. 2, pp. 99-121, 2000.
[40] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, "Adapting Visual Category Models to New Domains," Proc. 11th European Conf. Computer Vision, pp. 213-226, 2010.
[41] C. Schuldt, I. Laptev, and B. Caputo, "Recognizing Human Actions: A Local SVM Approach," Proc. Int'l Conf. Pattern Recognition, pp. 32-36, 2004.
[42] A.J. Smola, T.T. Frieß, and B. Schölkopf, "Semiparametric Support Vector and Linear Programming Machines," Proc. Conf. Advances in Neural Information Processing System, pp. 585-591, 1999.
[43] J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, and J. Li, "Hierarchical Spatio-Temporal Context Modeling for Action Recognition," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 2004-2011, 2009.
[44] A. Torralba, R. Fergus, and W.T. Freeman, "80 Million Tiny Images: A Large Data Set for Nonparametric Object and Scene Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1958-1970, Nov. 2008.
[45] P. Turaga, R. Chellappa, V.S. Subrahmanian, and O. Udrea, "Machine Recognition of Human Activities: A Survey," IEEE Trans. Circuits and Systems for Video Technology, vol. 18, no. 11, pp. 1473-1488, Nov. 2008.
[46] X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma, "Annotating Images by Mining Image Search Results," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1919-1932, Nov. 2008.
[47] X. Wu, D. Xu, L. Duan, and J. Luo, "Action Recognition using Context and Appearance Distribution Features," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 489-496, 2011.
[48] D. Xu, T.J. Cham, S. Yan, L. Duan, and S.-F. Chang, "Near Duplicate Identification with Spatially Aligned Pyramid Matching," IEEE Trans. Circuits and Systems for Video Technology, vol. 20, no. 8, pp. 1068-1079, Aug. 2010.
[49] D. Xu and S.-F. Chang, "Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1985-1997, Nov. 2008.
[50] J. Yang, R. Yan, and A.G. Hauptmann, "Cross-Domain Video Concept Detection Using Adaptive SVMs," Proc. ACM Int'l Conf. Multimedia, pp. 188-197, 2007.
94 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool