The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.07 - July (2012 vol.34)
pp: 1394-1408
Congcong Li , Cornell University, Ithaca
Adarsh Kowdle , Cornell University, Ithaca
Ashutosh Saxena , Cornell University, Ithaca
Tsuhan Chen , Cornell University, Ithaca
ABSTRACT
Scene understanding includes many related subtasks, such as scene categorization, depth estimation, object detection, etc. Each of these subtasks is often notoriously hard, and state-of-the-art classifiers already exist for many of them. These classifiers operate on the same raw image and provide correlated outputs. It is desirable to have an algorithm that can capture such correlation without requiring any changes to the inner workings of any classifier. We propose Feedback Enabled Cascaded Classification Models (FE-CCM), that jointly optimizes all the subtasks while requiring only a “black box” interface to the original classifier for each subtask. We use a two-layer cascade of classifiers, which are repeated instantiations of the original ones, with the output of the first layer fed into the second layer as input. Our training method involves a feedback step that allows later classifiers to provide earlier classifiers information about which error modes to focus on. We show that our method significantly improves performance in all the subtasks in the domain of scene understanding, where we consider depth estimation, scene categorization, event categorization, object detection, geometric labeling, and saliency detection. Our method also improves performance in two robotic applications: an object-grasping robot and an object-finding robot.
INDEX TERMS
Scene understanding, classification, machine learning, robotics.
CITATION
Congcong Li, Adarsh Kowdle, Ashutosh Saxena, Tsuhan Chen, "Toward Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.34, no. 7, pp. 1394-1408, July 2012, doi:10.1109/TPAMI.2011.232
REFERENCES
[1] S. Kumar and M. Hebert, "A Hierarchical Field Framework for Unified Context-Based Classification," Proc. 10th IEEE Int'l Conf. Computer Vision, 2005.
[2] A. Saxena, M. Sun, and A.Y. Ng, "Make3d: Learning 3d Scene Structure from a Single Still Image," IEEE Trans. Pattern and Machine Intelligence, vol. 30, no. 5, pp. 824-840, May 2009.
[3] D. Hoiem, A.A. Efros, and M. Hebert, "Closing the Loop on Scene Interpretation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[4] L.-J. Li, R. Socher, and L. Fei-Fei, "Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[5] E.B. Sudderth, A. Torralba, W.T. Freeman, and A.S. Willsky, "Depth from Familiar Objects: A Hierarchical Model for 3D Scenes," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
[6] C. Sutton and A. McCallum, "Joint Parsing and Semantic Role Labeling," Proc. Ninth Conf. Computational Natural Language Learning, 2005.
[7] D. Parikh, C. Zitnick, and T. Chen, "From Appearance to Context-Based Recognition: Dense Labeling in Small Images," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[8] A. Toshev, B. Taskar, and K. Daniilidis, "Object Detection via Boundary Structure Segmentation," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[9] A. Agarwal and B. Triggs, "Monocular Human Motion Capture with a Mixture of Regressors," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005.
[10] A. Saxena, J. Schulte, and A.Y. Ng, "Depth Estimation Using Monocular and Stereo Cues," Proc. 20th Int'l Joint Conf. Artificial Intelligence, 2007.
[11] G. Heitz, S. Gould, A. Saxena, and D. Koller, "Cascaded Classification Models: Combining Models for Holistic Scene Understanding," Proc. Neural Information Processing Systems, 2008.
[12] L. Hansen and P. Salamon, "Neural Network Ensembles," IEEE Trans. Pattern and Machine Intelligence, vol. 12, no. 10, pp. 993-1001, Oct. 1990.
[13] Y. Freund and R.E. Schapire, "Cascaded Neural Networks Based Image Classifier," Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 1993.
[14] R. Collobert and J. Weston, "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning," Proc. 25th Int'l Conf. Machine Learning, 2008.
[15] Y. Freund and R.E. Schapire, "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting," Proc. Second European Conf. Computational Learning Theory, 1995.
[16] S.C. Brubaker, J. Wu, J. Sun, M.D. Mullin, and J.M. Rehg, "On the Design of Cascades of Boosted Ensembles for Face Detection," Int'l J. Computer Vision, vol. 77, nos. 1-3, pp. 65-86, 2008.
[17] P. Viola and M.J. Jones, "Robust Real-Time Face Detection," Int'l J. Computer Vision, vol. 57, no. 2, pp. 137-154, 2004.
[18] M. Fink and P. Perona, "Mutual Boosting for Contextual Inference," Proc. Advances in Neural Information Processing Systems, 2004.
[19] A. Torralba, K. Murphy, and W. Freeman, "Contextual Models for Object Detection Using Boosted Random Fields," Proc. Advances in Neural Information Processing Systems, 2005.
[20] Z. Tu, "Auto-Context and Its Application to High-Level Vision Tasks," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[21] C. Li, A. Kowdle, A. Saxena, and T. Chen, "Feedback Enabled Cascaded Classification Models for Scene Understanding," Proc. Advances in Neural Information Processing Systems, 2010.
[22] A. Kowdle, C. Li, A. Saxena, and T. Chen, "A Generic Model to Compose Vision Modules for Holistic Scene Understanding," Proc. European Conf. Computer Vision Workshop Parts and Attributes, 2010.
[23] J. Kittler, M. Hatef, R.P. Duin, and J. Matas, "On Combining Classifiers," IEEE Trans. Pattern and Machine Intelligence, vol. 20, no. 3, pp. 226-239, Mar. 1998.
[24] C. Li, A. Saxena, and T. Chen, "$\theta$ -MRF:Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding," Proc. Advances in Neural Information Processing Systems, 2011.
[25] R.K. Ando and T. Zhang, "A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data," J. Machine Learning Research, vol. 6, pp. 1817-1853, Dec. 2005.
[26] I. Tsochantaridis, T. Hofmann, and T. Joachims, "Support Vector Machine Learning for Interdependent and Structured Output Spaces," Proc. 21st Int'l Conf. Machine learning, 2004.
[27] B. Taskar, C. Guestrin, and D. Koller, "Max-Margin Markov Networks," Proc. Advances in Neural Information Processing Systems, 2003.
[28] H. Koppula, A. Anand, T. Joachims, and A. Saxena, "Semantic Labeling of 3D Point Clouds for Indoor Scenes," Proc. Neural Information Processing Systems, 2011.
[29] A. Quattoni, M. Collins, and T. Darrell, "Conditional Random Fields for Object Recognition," Proc. Neural Information Processing Systems, 2004.
[30] C.-N.J. Yu and T. Joachims, "Learning Structural SVMs with Latent Variables," Proc. 26th Ann. Int'l Conf. Machine Learning, 2009.
[31] A. Torralba, "Contextual Priming for Object Detection," Int'l J. Computer Vision, vol. 53, no. 2, pp. 169-191, 2003.
[32] A. Torralba and A. Oliva, "Depth Estimation from Image Structure," IEEE Trans. Pattern and Machine Intelligence, vol. 24, no. 9, pp. 1226-1238, Sept. 2002.
[33] D. Hoiem, A.A. Efros, and M. Hebert, "Putting Objects in Perspective," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[34] D. Park, D. Ramanan, and C. Fowlkes, "Multiresolution Models for Object Detection," Proc. 11th European Conf. Computer Vision, 2010.
[35] G. Heitz and D. Koller, "Learning Spatial Context: Using Stuff to Find Things," Proc. European Conf. Computer Vision, 2008.
[36] J. Lim, P. Arbel andez, C. Gu, and J. Malik, "Context by Region Ancestry," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[37] S. Kumar and M. Hebert, "A Hierarchical Field Framework for Unified Context-Based Classification," Proc. 10th IEEE Int'l Conf. Computer Vision, 2005.
[38] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, "Object Detection with Discriminatively Trained Part Based Models," IEEE Trans. Pattern and Machine Intelligence, vol. 32, no. 9, pp. 1627-1645, Sept. 2010.
[39] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, "Objects in Context," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[40] D. Parikh, C. Zitnick, and T. Chen, "From Appearance to Context-Based Recognition: Dense Labeling in Small Images," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
[41] C. Desai, D. Ramanan, and C. Fowlkes, "Discriminative Models for Multi-Class Object Layout," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[42] B. Yao and L. Fei-Fei, "Modeling Mutual Context of Object and Human Pose in Human-object Interaction Activities," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[43] C. Galleguillos, B. McFee, S. Belongie, and G. Lanckriet, "Multi-Class Object Localization by Combining Local Contextual Interactions," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[44] S. Divvala, D. Hoiem, J. Hays, A. Efros, and M. Hebert, "An Empirical Study of Context in Object Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[45] M. Blaschko and C. Lampert, "Object Localization with Global and Local Context Kernels," Proc. British Machine Vision Conf., 2009.
[46] L. Li and L. Fei-Fei, "What, Where and Who? Classifying Event by Scene and Object Recognition," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007.
[47] Y. Bengio and Y. LeCun, "Scaling Learning Algorithms towards AI," Large-Scale Kernel Machines, The MIT Press, 2007.
[48] R. Caruana, "Multitask Learning," Machine Learning, vol. 28, pp. 41-75, 1997.
[49] Y. LeCun, L. Bottou, G. Orr, and K. Muller, "Efficient Backprop," Neural Networks: Tricks of the Trade, G. Orr and M.K., eds., Springer, 1998.
[50] I. Goodfellow, Q. Le, A. Saxena, H. Lee, and A. Ng, "Measuring Invariances in Deep Networks," Proc. Neural Information Processing Systems, 2009.
[51] G. Hinton, S. Osindero, and Y.-W. Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, vol. 18, pp. 1527-1554, 2006.
[52] M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus, "Deconvolutional Networks," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010.
[53] A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. Royal Statistical Soc., Series B, vol. 39, no. 1, pp. 1-38, 1977.
[54] R. Neal and G. Hinton, "A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants," Learning in Graphical Models, vol. 89, pp. 355-368, 1998.
[55] M. Gibbs and D. Mackay, "Variational Gaussian Process Classifiers," IEEE Trans. Neural Networks, vol. 11, no. 6, pp. 1458-1464, Nov. 2000.
[56] J. Mairal, M. Leordeanu, F. Bach, M. Hebert, and J. Ponce, "Discriminative Sparse Image Models for Class-Specific Edge Detection and Image Interpretation," Proc. 10th European Conf. Computer Vision, 2008.
[57] A. Oliva and A. Torralba, "Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope," Int'l J. Computer Vision, vol. 42, pp. 145-175, 2001.
[58] A. Torralba and A. Oliva, "MIT Outdoor Scene Data Set," http://people.csail.mit.edu/torralba/code/ spatialenvelopeindex.html, 2012.
[59] A. Saxena, S.H. Chung, and A.Y. Ng, "3-D Depth Reconstruction from a Single Still Image," Int'l J. Computer Vision, vol. 76, pp. 53-69, 2007.
[60] A. Saxena, S. Chung, and A. Ng, "Learning Depth from Single Monocular Images," Proc. Neural Information Processing Systems, 2005.
[61] A. Torralba, A. Oliva, M.S. Castelhano, and J.M. Henderson, "Contextual Guidance of Eye Movements and Attention in Real-World Scenes: The Role of Global Features in Object Search," Psychological Rev., vol. 113, no. 4, pp. 766-786, 2006.
[62] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, "Frequency-Tuned Salient Region Detection," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
[63] M. Everingham, A. Zisserman, C.K.I. Williams, and L. Van Gool, "The PASCAL VOC2006 Results," http://www.pascal-network. org/challenges/ VOC/voc2006results.pdf, 2012.
[64] P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan, "Discriminatively Trained Deformable Part Models, Release 3," http://people.cs.uchicago.edu/~pfflatent-release3 /, 2012.
[65] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2005.
[66] A. Saxena, J. Driemeyer, J. Kearns, and A.Y. Ng, "Robotic Grasping of Novel Objects," Proc. Neural Information Processing Systems, 2006.
[67] A. Saxena, J. Driemeyer, and A.Y. Ng, "Robotic Grasping of Novel Objects Using Vision," Int'l J. Robotics Research, vol. 27, no. 2, pp. 157-173, 2008.
[68] L. Fei-Fei and P. Perona, "A Bayesian Hierarchical Model for Learning Natural Scene Categories," Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2005.
[69] V. Hedau, D. Hoiem, and D. Forsyth, "Recovering the Spatial Layout of Cluttered Rooms," Proc. 12th IEEE Int'l Conf. Computer Vision, 2009.
[70] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A. Zisserman, "The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results," http://www.pascal-network.org/ challenges/ VOC/voc2007/workshopindex.html , 2012.
[71] C. Li, T. Wong, N. Xu, and A. Saxena, "FECCM for Scene Understanding: Helping the Robot to Learn Multiple Tasks," Proc. IEEE Int'l Conf. Robotics and Automation, http://chenlab. ece.cornell.edu/projects FECCM/, 2011.
47 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool