This Article 
 Bibliographic References 
 Add to: 
Depth Estimation from Image Structure
September 2002 (vol. 24 no. 9)
pp. 1226-1238

Abstract—In the absence of cues for absolute depth measurements as binocular disparity, motion, or defocus, the absolute distance between the observer and a scene cannot be measured. The interpretation of shading, edges, and junctions may provide a 3D model of the scene but it will not provide information about the actual “scale” of the space. One possible source of information for absolute depth estimation is the image size of known objects. However, object recognition, under unconstrained conditions, remains difficult and unreliable for current computational approaches. Here, we propose a source of information for absolute depth estimation based on the whole scene structure that does not rely on specific objects. We demonstrate that, by recognizing the properties of the structures present in the image, we can infer the scale of the scene and, therefore, its absolute mean depth. We illustrate the interest in computing the mean depth of the scene with application to scene recognition and object detection.

[1] R. Baddeley, “The Correlational Structure of Natural Images and the Calibration of Spatial Representations,” Cognitive Science, vol. 21, pp. 351-372, 1997.
[2] H.G. Barrow and J.M. Tenenbaum, “Interpreting Line Drawings as Tree-Dimensional Surfaces,” Artificial Intelligence, vol. 17, pp. 75-116, 1981.
[3] K. Barnard and D.A. Forsyth, “Learning the Semantics of Words and Pictures,” Proc. Int'l Conf. Computer Vision, vol. 2, pp. 408-415, 2001.
[4] J.R. Bergen and M.S. Landy, “Computational Modeling of Visual Texture Segregation,” Computational Models of Visual Processing, M.S. Landy and J.A. Movshon, eds., pp. 253-271, Cambridge, Mass.: MIT Press, 1991.
[5] C. Carson, S. Belongie, H. Greenspan, and J. Malik, “Region-Based Image Querying,” Proc. Int'l Workshop Content-Based Access of Image and Video libraries, 1997.
[6] J.M. Coughlan and A.L. Yuille, “Manhattan World: Compass Direction from a Single Image by Bayesian Inference,” Proc. Int'l Conf. Computer Vision, pp. 941-947, 1999.
[7] J.S. De Bonet and P. Viola, “Structure Driven Image Database Retrieval,” Advances in Neural Information Processing, vol. 10, pp. 866-872, 1997.
[8] D.J. Field, “Relations between the Statistics of Natural Images and the Response Properties of Cortical Cells,” J. Optical Soc. Am., vol. 4, pp. 2379-2394, 1987.
[9] N. Gershnfeld, The Nature of Mathematical Modeling, Cambridge Univ. Press, 1999.
[10] M.M. Gorkani and R.W. Picard, "Texture orientation for sorting photos at a glance," Proc. 12th Intl Conf. Pattern Recognition,Jerusalem, vol. 67, no. 5, pp. A459-A464, Oct. 1994.
[11] P.J. Hancock, R.J. Baddeley, and L.S. Smith, “The Principal Components of Natural Images,” Network, vol. 3, pp. 61-70, 1992.
[12] D.J. Heeger and J.R. Bergen, “Pyramid-Based Texture Analysis/Synthesis,” SIGGRAPH 95 Conf. Proc., R.L. Cook, ed., pp. 229-238, Aug. 1995.
[13] B.K.P. Horn and M.J. Brooks, Shape from Shading. MIT Press, 1989.
[14] A. Jepson, W. Richards, and D. Knill, “Modal Structures and Reliable Inference,” Perception as Bayesian Inference, D. Knill and W. Richards, eds., pp. 63-92, Cambridge Univ. Press, 1996.
[15] M.I. Jordan and R.A. Jacobs, “Hierarchical Mixtures of Experts and the EM Algorithm,” Neural Computation, vol. 6, pp. 181-214, 1994.
[16] J.M. Keller, R.M. Crownover, and R.Y. Chen, “Characteristics of Natural Scenes Related to the Fractal Dimension,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, no. 5, pp. 621-627, 1987.
[17] A.B. Lee and D. Mumford, “Occlusion Models for Natural Images: A Statistical Study of Scale-Invariant Dead Leaves Model,” Int'l J. Computer Vision, vol. 41, nos. 1 and 2, 2001.
[18] T. Lindeberg, “Detecting Salient Blob-Like Image Structures,” Int'l J. Computer Vision, vol. 11, no. 3, 1993.
[19] T. Lindeberg, “Principles for Automatic Scale Selection,” Int'l J. Computer Vision, vol. 30, no. 2, pp. 77-116, 1998.
[20] F. Liu and R.W. Picard, “Periodicity, Directionality, and Randomness: Wold Features for Image Modelling and Retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 7, pp. 722-733, July 1996.
[21] A. Oliva, A. Torralba, A. Guerin-Dugue, and J. Herault, “Global Semantic Classification of Scenes Using Power Spectrum Templates,” Proc. Challenge of Image Retrieval, 1999.
[22] A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope,” Int'l J. Computer Vision, vol. 42, no. 3, pp. 145-175, 2001.
[23] B.A. Olshausen and D.J. Field, “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,” Nature, vol. 381, pp. 607-609, 1996.
[24] S.E. Palmer, Vision Science, Cambridge, Mass.: MIT Press, 1999.
[25] A. Papoulis, Probability, Random Variables and Stochastic Processes, second ed. MacGraw-Hill, 1984.
[26] A.P. Pentland, “Fractal-Based Description of Natural Scenes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, pp. 661-674, 1984.
[27] J. Portilla and E.P. Simoncelli, "A Parametric Texture Model Based on Joint Statistics of Complex Wavelet Coefficients," Int'l J. of Computer Vision, 2000, to appear; currently available on the Web at.
[28] D.L. Ruderman, “Origins of Scaling in Natural Images,” Vision Research, vol. 37, pp. 3385-3398, 1997.
[29] B.E. Rogowitz, T. Frese, J.R. Smith, C. Bouman, and E. Kalin, “Perceptual Image Similarity Experiments,” Proc. SPIE, Conf. Human Vision and Electronic Imaging, Jan. 1998.
[30] A. van der Schaaf and J.H. van Hateren, ”Modeling of the Power Spectra of Natural Images: Statistics and Information,” Vision Research, vol. 36, pp. 2759-2770, 1996.
[31] B. Schiele and J.L. Crowley, “Recognition without Correspondence Using Multidimensional Receptive Field Histograms,” Int'l J. Computer Vision, vol. 36, no. 1, pp. 31-50, 2000.
[32] I. Shimshoni, Y. Moses, and M. Lindenbaum, “Shape Reconstruction of 3D Bilaterally Symmetric Surfaces,” Int'l J. Computer Vision, vol. 2, pp. 1-15, 2000.
[33] E.P. Simoncelli and W.T. Freeman, “The Steerable Pyramid: A Flexible Architecture for Multi-Scale Derivative Computation,” Proc. Second IEEE Int'l Conf. Image Processing, Oct. 1995.
[34] E.P. Simoncelli and B.A. Olshausen, “Natural Image Statistics and Neural Representation,” Ann. Rev. Neuroscience, vol. 24, pp. 1193-1216, 2001.
[35] B.J. Super and A.C. Bovik, “Shape from Texture Using Local Spectral Moments,” IEEE Trans. Pattern Analysis Machine Intelligence, vol. 17, no. 4, pp. 333-343, Apr. 1995.
[36] M. Szummer and R.W. Picard, “Indoor-Outdoor Image Classification,” Proc. IEEE Int'l Workshop Content-Based Access of Image and Video Databases, 1998.
[37] A. Torralba and A. Oliva, “Scene Organization Using Discriminant Structural Templates,” Proc. Int'l Conf. Computer Vision, pp. 1253-1258, 1999.
[38] A. Torralba and P. Sinha, “Statistical Context Priming for Object Detection,” Proc. Int'l Conf. Computer Vision, vol. 1, pp. 763-770, 2001.
[39] A. Vailaya, A. Jain, and H.J. Zhang, “On Image Classification: City Images vs. Landscapes,” Pattern Recognition, vol. 31, pp. 1921-1935, 1998.
[40] S.C. Zhu, Y. Wu, and D. Mumford, “Filters Random Fields and Maximum Entropy(Frame)—Towards a Unified Theory for Texture Modeling,” Int'l J. Computer Vision, vol. 27, no. 2, pp. 107-126, 1998.

Index Terms:
Depth, image statistics, scene structure, scene recognition, scale selection, monocular vision.
Antonio Torralba, Aude Oliva, "Depth Estimation from Image Structure," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 9, pp. 1226-1238, Sept. 2002, doi:10.1109/TPAMI.2002.1033214
Usage of this product signifies your acceptance of the Terms of Use.