Subscribe
Issue No.08 - Aug. (2013 vol.35)
pp: 1915-1929
Clement Farabet , New York University, New York and Universite Paris-Est, Paris
Camille Couprie , New York University, New York
Laurent Najman , Universite Paris-Est, Paris
Yann LeCun , New York University, New York
ABSTRACT
Scene labeling consists of labeling each pixel in an image with the category of the object it belongs to. We propose a method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel. The method alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information. We report results using multiple postprocessing methods to produce the final labeling. Among those, we propose a technique to automatically retrieve, from a pool of segmentation components, an optimal set of components that best explain the scene; these components are arbitrary, for example, they can be taken from a segmentation tree or from any family of oversegmentations. The system yields record accuracies on the SIFT Flow dataset (33 classes) and the Barcelona dataset (170 classes) and near-record accuracy on Stanford background dataset (eight classes), while being an order of magnitude faster than competing approaches, producing a $(320\times 240)$ image labeling in less than a second, including feature extraction.
INDEX TERMS
Feature extraction, Image segmentation, Labeling, Vectors, Context, Image edge detection, Accuracy, scene parsing, Convolutional networks, deep learning, image segmentation, image classification
CITATION
Clement Farabet, Camille Couprie, Laurent Najman, Yann LeCun, "Learning Hierarchical Features for Scene Labeling", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 8, pp. 1915-1929, Aug. 2013, doi:10.1109/TPAMI.2012.231
REFERENCES
 [1] P. Arbeláez, M. Maire, C. Fowlkes, and J. Malik, "Contour Detection and Hierarchical Image Segmentation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 898-916, May 2011. [2] Y. Boykov and M.P. Jolly, "Interactive Graph Cuts for Optimal Boundary & Region Segmentation of Objects in n-d Images," Proc. IEEE Int'l Conf. Computer Vision, vol. 1, pp. 105-112, 2001. [3] Y. Boykov and V. Kolmogorov, "An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1124-1137, Sept. 2004. [4] Y. Boykov, O. Veksler, and R. Zabih, "Fast Approximate Energy Minimization via Graph Cuts," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001. [5] J. Carreira and C. Sminchisescu, "CPMC: Automatic Object Segmentation Using Constrained Parametric Min-Cuts," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 7, pp. 1312-1328, July 2012. [6] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber, "A Committee of Neural Networks for Traffic Sign Classification," Proc. Int'l Joint Conf. Neural Networks, pp. 1918-1921, 2011. [7] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, "Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers," Proc. Int'l Conf. Machine Learning, June 2012. [8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, "Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers," CoRR, Feb. 2012. [9] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello, "Hardware Accelerated Convolutional Neural Networks for Synthetic Vision Systems," Proc. Int'l Symp. Circuits and Systems, May 2010. [10] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, "Neuflow: A Runtime Reconfigurable Dataflow Processor for Vision," Proc. Fifth IEEE Workshop Embedded Computer Vision, 2011. [11] P. Felzenszwalb and D. Huttenlocher, "Efficient Graph-Based Image Segmentation," Int'l J. Computer Vision, vol. 59, pp. 167-181, 2004. [12] L.R. Ford and D.R. Fulkerson, "A Simple Algorithm for Finding Maximal Network Flows and an Application to the Hitchcock Problem," technical report, RAND Corp., 1955. [13] B. Fulkerson, A. Vedaldi, and S. Soatto, "Class Segmentation and Object Localization with Superpixel Neighborhoods," Proc. 12th IEEE Int'l Conf. Computer Vision, pp. 670-677, 2009. [14] C. Garcia and M. Delakis, "Convolutional Face Finder: A Neural Architecture for Fast and Robust Face Detection," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 11, pp. 1408-1428, Nov. 2004. [15] S. Gould, R. Fulton, and D. Koller, "Decomposing a Scene into Geometric and Semantically Consistent Regions," Proc. IEEE Int'l Conf. Computer Vision, pp. 1-8, Sept. 2009. [16] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller, "Multi-Class Segmentation with Relative Location Prior," Int'l J. Computer Vision, vol. 80, no. 3, pp. 300-316, Dec. 2008. [17] D. Grangier, L. Bottou, and R. Collobert, "Deep Convolutional Networks for Scene Parsing," Proc. Int'l Conf. Machine Learning, 2009. [18] R. Hadsell, S. Chopra, and Y. LeCun, "Dimensionality Reduction by Learning an Invariant Mapping." Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2006. [19] X. He and R. Zemel, "Learning Hybrid Models for Image Annotation with Partially Labeled Data," Proc. Advances in Neural Information Processing Systems Conf., 2008. [20] V. Jain, J.F. Murray, F. Roth, S. Turaga, V. Zhigulin, K. Briggman, M. Helmstaedter, W. Denk, and S.H. Seung, "Supervised Learning of Image Restoration with Convolutional Networks," Proc. 11th IEEE Int'l Conf. Computer Vision, 2007. [21] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, "What Is the Best Multi-Stage Architecture for Object Recognition?" Proc. IEEE Int'l Conf. Computer Vision, 2009. [22] K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun, "Learning Invariant Features Through Topographic Filter Maps," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. [23] K. Kavukcuoglu, M. Ranzato, and Y. LeCun, "Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition," Technical Report CBLL-TR-2008-12-01, Courant Inst. of Math. Sciences, New York Univ., 2008. [24] K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun, "Learning Convolutional Feature Hierachies for Visual Recognition," Proc. Advances in Neural Information Processing Systems Conf., vol. 23, 2010. [25] M. Kumar and D. Koller, "Efficiently Selecting Regions for Scene Understanding," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 3217-3224, 2010. [26] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel, "Handwritten Digit Recognition with a Back-Propagation Network," Proc. Advances in Neural Information Processing Systems Conf., 1990. [27] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning Applied to Document Recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998. [28] Y. LeCun, L. Bottou, G. Orr, and K. Muller, "Efficient Backprop," Neural Networks: Tricks of the Trade, Springer, 1998. [29] H. Lee, R. Grosse, R. Ranganath, and Y.N. Andrew., "Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations," Proc. Int'l Conf. Machine Learning, 2009. [30] V. Lempitsky, A. Vedaldi, and A. Zisserman, "A Pylon Model for Semantic Segmentation," Proc. Advances in Neural Information Processing Systems Conf., 2011. [31] C. Liu, J. Yuen, and A. Torralba, "Nonparametric Scene Parsing: Label Transfer via Dense Scene Alignment," Artificial Intelligence, 2009. [32] D. Munoz, J. Bagnell, and M. Hebert, "Stacked Hierarchical Labeling," Proc. 11th European Conf. Computer Vision, Jan. 2010. [33] L. Najman and M. Schmitt, "Geodesic Saliency of Watershed Contours and Hierarchical Segmentation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 12, pp. 1163-1173, Dec. 1996. [34] F. Ning, D. Delhomme, Y. LeCun, F. Piano, L. Bottou, and P. Barbano, "Toward Automatic Phenotyping of Developing Embryos from Videos," IEEE Trans. Image Processing, vol. 14, no. 9, pp. 1360-1371, Sept. 2005. [35] M. Osadchy, Y. LeCun, and M. Miller, "Synergistic Face Detection and Pose Estimation with Energy-Based Models," J. Machine Learning Research, vol. 8, pp. 1197-1215, 2007. [36] C. Pantofaru, C. Schmid, and M. Hebert, "Object Recognition by Integrating Multiple Image Segmentations." Proc. 10th European Conf. Computer Vision, pp. 481-494, 2008. [37] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun, "Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition." Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007. [38] B. Russell, A. Torralba, C. Liu, R. Fergus, and W. Freeman, "Object Recognition by Scene Alignment," Proc. Neural Advances in Neural Information Conf., 2007. [39] C. Russell, P.H.S. Torr, and P. Kohli, "Associative Hierarchical CRFs for Object Class Image Segmentation," Proc. IEEE Int'l Conf. Computer Vision, 2009. [40] H. Schulz and S. Behnke., "Learning Object-Class Segmentation with Convolutional Neural Networks." Proc. 11th European Symp. Artificial Neural Networks, 2012. [41] J. Shotton, J.M. Winn, C. Rother, and A. Criminisi, "TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-Class Object Recognition and Segmentation," Proc. European Conf. Computer Vision, pp. 1-15, 2006. [42] P. Simard, D. Steinkraus, and J. Platt, "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis." Proc. Seventh Int'l Conf. Document Analysis and Recognition, vol. 2, pp. 958-962, 2003. [43] R. Socher, C.C. Lin, A.Y. Ng, and C.D. Manning, "Parsing Natural Scenes and Natural Language with Recursive Neural Networks." Proc. 26th Int'l Conf. Machine Learning, 2011. [44] J. Tighe and S. Lazebnik, "Superparsing: Scalable Nonparametric Image Parsing with Superpixels," Proc. European Conf. Computer Vision, pp. 352-365, 2010. [45] A. Torralba and A.A. Efros, "Unbiased Look at Data Set Bias," Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1521-1528, 2011. [46] S. Turaga, K. Briggman, M. Helmstaedter, W. Denk, and H. Seung, "Maximin Affinity Learning of Image Segmentation," Proc. Advances in Neural Information Processing Systems Conf., Jan. 2009. [47] R. Vaillant, C. Monrocq, and Y. LeCun, "Original Approach for the Localisation of Objects in Images," IEE Proc. Vision, Image, and Signal Processing, vol. 141, no. 4, pp. 245-250, Aug. 1994.