The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January (2011 vol.33)
pp: 30-42
Antonio Criminisi , Microsoft Research Cambridge, Cambridge
John Winn , Microsoft Research Cambridge, Cambridge
Pei Yin , Microsoft Corp, Redmond
ABSTRACT
This paper presents an automatic segmentation algorithm for video frames captured by a (monocular) webcam that closely approximates depth segmentation from a stereo camera. The frames are segmented into foreground and background layers that comprise a subject (participant) and other objects and individuals. The algorithm produces correct segmentations even in the presence of large background motion with a nearly stationary foreground. This research makes three key contributions: First, we introduce a novel motion representation, referred to as “motons,” inspired by research in object recognition. Second, we propose estimating the segmentation likelihood from the spatial context of motion. The estimation is efficiently learned by random forests. Third, we introduce a general taxonomy of tree-based classifiers that facilitates both theoretical and experimental comparisons of several known classification algorithms and generates new ones. In our bilayer segmentation algorithm, diverse visual cues such as motion, motion context, color, contrast, and spatial priors are fused by means of a conditional random field (CRF) model. Segmentation is then achieved by binary min-cut. Experiments on many sequences of our videochat application demonstrate that our algorithm, which requires no initialization, is effective in a variety of scenes, and the segmentation results are comparable to those obtained by stereo systems.
INDEX TERMS
Computer vision, image understanding, machine learning, decision tree, random forests, boosting, motion analysis.
CITATION
Antonio Criminisi, John Winn, Pei Yin, "Bilayer Segmentation of Webcam Videos Using Tree-Based Classifiers", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.33, no. 1, pp. 30-42, January 2011, doi:10.1109/TPAMI.2010.65
REFERENCES
[1] Y. Amit and D. Geman, "Shape Quantization and Recognition with Randomized Trees," Neural Computation, vol. 9, no. 7, pp. 1545-1588, 1997.
[2] S. Baker, R. Szeliski, and P. Anandan, "A Layered Approach to Stereo Reconstruction," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 434-441, 1998.
[3] J. Bergen, P. Burt, R. Hingorani, and S. Peleg, "A Three-Frame Algorithm for Estimating Two-Component Image Motion," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, no. 9, pp. 886-896, Sept. 1992.
[4] C. Bishop, Neural Networks for Pattern Recognition. Oxford Univ. Press, 1995.
[5] Y. Boykov, O. Veksler, and R. Zabih, "Fast Approximate Energy Minimization via Graph Cuts," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001.
[6] L. Breiman, "Arcing Classifiers," Annals of Statistics, vol. 26, no. 3, pp. 801-824, 1998.
[7] L. Breiman, "Random Forests," Technical Report TR567, Univ. of California Berkeley, 1999.
[8] A. Criminisi, G. Cross, A. Blake, and V. Kolmogorov, "Bilayer Segmentation of Live Video," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 53-60, 2006.
[9] T. Deselaers, A. Criminisi, J. Winn, and A. Agarwal, "Incorporating On-Demand Stereo for Real Time Recognition," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[10] Y. Freund and R. Schapire, "A Decision Theoretic Generalization of On-Line Learning and Application to Boosting," J. Computer and System Science, vol. 55, no. 1, pp. 119-139, 1997.
[11] J. Friedman, T. Hastie, and R. Tibshirani, "Additive Logistic Regression: A Statistical View of Boosting," Annals of Statistics, vol. 38, pp. 337-374, 2000.
[12] N. Jojic and B. Frey, "Learning Flexible Sprites in Video Layers," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 199-206, 2001.
[13] V. Kolmogorov, A. Criminisi, A. Blake, G. Cross, and C. Rother, "Bi-Layer Segmentation of Binocular Stereo Video," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 407-414, 2005.
[14] E. Kong and T. Dietterich, "Error-Correcting Output Coding Corrects Bias and Variance," Proc. 12th Int'l Conf. Machine Learning, pp. 313-321, 1995.
[15] S. Kumar and M. Hebert, "Discriminative Random Fields: A Discriminative Framework for Contextual Interaction in Classification," Proc. IEEE Int'l Conf. Computer Vision, pp. 1150-1157, 2003.
[16] J. Lafferty, A. McCallum, and F. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th Int'l Conf. Machine Learning, pp. 282-289, 2001.
[17] V. Lepetit, P. Lagger, and P. Fua, "Randomized Trees for Real-Time Keypoint Recognition," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, June 2005.
[18] T. Leung and J. Malik, "Representing and Recognizing the Visual Appearance of Materials Using Three-Dimensional Textons," Int'l J. Computer Vision, vol. 43, no. 1, pp. 29-44, 2001.
[19] R. Lienhart, A. Kuranov, and V. Pisarevsky, "Empirical Analysis of Detection Cascades of Boosted Classifiers for Rapid Object Detection," Proc. DAGM, 25th Pattern Recognition Symp., pp. 297-304, 2003.
[20] M. Özuysal, M. Calonder, V. Lepetit, and P. Fua, "Fast Keypoint Recognition Using Random Ferns," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 3, pp. 448-461, Mar. 2009.
[21] M. Özuysal, P. Fua, and V. Lepetit, "Fast Keypoint Recognition in Ten Lines of Code," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2007.
[22] F. Perronnin, G. Dance, C. Csurka, and M. Bressan, "Adapted Vocabularies for Generic Visual Categorization," Proc. IEEE European Conf. Computer Vision, 2006.
[23] J. Quinlan, "Bagging, Boosting, and C4.5," Proc. Nat'l Conf. Artificial Intelligence, pp. 725-730, 1996.
[24] L. Rabiner and B. Juang, Fundamentals of Speech Recognition. Prentice Hall, 1993.
[25] C. Rother, V. Kolmogorov, and A. Blake, "GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts," ACM Trans. Graphics, vol. 23, no. 3, pp. 309-314, 2004.
[26] B. Scholkopf, "Statistical Learning and Kernel Methods," Technical Report MSR-TR 2000-23, 2000.
[27] J. Shotton, J. Winn, C. Rother, and A. Criminisi, "Textonboost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation," Proc. IEEE European Conf. Computer Vision, 2006.
[28] J. Sun, W. Zhang, X. Tang, and H. Shum, "Background Cut," Proc. IEEE European Conf. Computer Vision, pp. 628-641, 2006.
[29] P.H.S. Torr, R. Szeliski, and P. Anandan, "An Integrated Bayesian Approach to Layer Extraction from Image Sequences," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 3, pp. 297-303, Mar. 2001.
[30] A. Torralba, K. Murphy, and W. Freeman, "Sharing Features: Efficient Boosting Procedures for Multiclass Object Detection," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 762-769, 2004.
[31] V. Vapnik, Statistical Learning Theory. Wiley-Interscience, Sept. 1998.
[32] M. Varma and A. Zisserman, "A Statistical Approach to Texture Classification from Single Images," Int'l J. Computer Vision, vol. 62, nos. 1-2, pp. 61-81, 2005.
[33] P. Viola and M. Jones, "Robust Real-Time Object Detection," Int'l J. Computer Vision, vol. 57, no. 2, pp. 137-154, 2004.
[34] J.Y.A. Wang and E.H. Adelson, "Layered Representation for Motion Analysis," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, pp. 361-366, 1993.
[35] J.Y.A. Wang and E.H. Adelson, "Representing Moving Images with Layers," IEEE Trans. Image Processing, vol. 3, no. 5, pp. 625-638, Sept. 1994.
[36] J. Winn, A. Criminisi, and T. Minka, "Object Categorization by Learned Universal Visual Dictionary," Proc. IEEE Int'l Conf. Computer Vision, pp. 1800-1807, 2005.
[37] P. Yin, I. Essa, T. Starner, and J.M. Rehg, "Discriminative Feature Selection for Hidden Markov Models Using Segmental Boosting," Proc. IEEE Int'l Conf. Acoustics, Speech and Signal Processing, Mar. 2008.
[38] A. Yuille, "Deformable Templates for Face Recognition," J. Cognitive Neuroscience, vol. 3, no. 1 pp. 59-70, 1991.
[39] C. Zhang, P. Yin, Y. Rui, R. Cutler, P. Viola, X. Sun, N. Pinto, and Z. Zhang, "Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos," IEEE Trans. Multimedia, vol. 10, no. 8, pp. 1541-1552, Dec. 2008.
22 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool