This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Structure Inference for Bayesian Multisensory Scene Understanding
December 2008 (vol. 30 no. 12)
pp. 2140-2157
We investigate a solution to the problem of multi-sensor scene understanding by formulating it in the framework of Bayesian model selection and structure inference. Humans robustly associate multimodal data as appropriate, but previous modelling work has focused largely on optimal fusion, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multi-sensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Such explicit inference of multimodal data association is also of intrinsic interest for higher level understanding of multisensory data. We illustrate this using a probabilistic implementation of data association in a multi-party audio-visual scenario, where unsupervised learning and structure inference is used to automatically segment, associate and track individual subjects in audiovisual sequences. Indeed, the structure inference based framework introduced in this work provides the theoretical foundation needed to satisfactorily explain many confounding results in human psychophysics experiments involving multimodal cue integration and association.

[1] CLEAR 2006 Evaluation and Workshop Campaign, http:/www.clear-evaluation.org/, Apr. 2006.
[2] D. Alais and D. Burr, “The Ventriloquist Effect Results from Near-Optimal Bimodal Integration,” Current Biology, vol. 14, no. 3, pp.257-262, Feb. 2004.
[3] Y. Bar-Shalom, T. Kirubarajan, and X. Lin, “Probabilistic Data Association Techniques for Target Tracking with Applications to Sonar, Radar and EO Sensors,” IEEE Aerospace and Electronic Systems Magazine, vol. 20, no. 8, pp. 37-56, 2005.
[4] Y. Bar-Shalom and E. Tse, “Tracking in a Cluttered Environment with Probabilistic Data Association,” Automatica, vol. 11, pp. 451-460, 1975.
[5] P.W. Battaglia, R.A. Jacobs, and R.N. Aslin, “Bayesian Integration of Visual and Auditory Signals for Spatial Localization,” J. Optical Soc. Am. A: Optics, Image Science, and Vision, vol. 20, no. 7, pp. 1391-1397, July 2003.
[6] M.J. Beal, N. Jojic, and H. Attias, “A Graphical Model for Audiovisual Object Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 828-836, July 2003.
[7] J. Bilmes, “Dynamic Bayesian Multinets,” Proc. 16th Ann. Conf. Uncertainty in Artificial Intelligence (UAI), 2000.
[8] C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller, “Context-Specific Independence in Bayesian Networks,” Proc. 12th Ann. Conf. Uncertainty in Artificial Intelligence (UAI), 1996.
[9] Y. Chen and Y. Rui, “Real-Time Speaker Tracking Using Particle Filter Sensor Fusion,” Proc. IEEE, vol. 92, no. 3, pp. 485-494, Mar. 2004.
[10] M.O. Ernst and M.S. Banks, “Humans Integrate Visual and Haptic Information in a Statistically Optimal Fashion,” Nature, vol. 415, pp. 429-433, 2002.
[11] J.W. Fisher III and T. Darrell, “Speaker Association with Signal-Level Audiovisual Fusion,” IEEE Trans. Multimedia, vol. 6, no. 3, pp. 406-413, 2004.
[12] T.E Fortmann, Y. Bar-Shalom, and M. Scheffe, “Sonar Tracking of Multiple Targets Using Joint Probabilistic Data Association,” IEEE J. Oceanic Eng., vol. 8, pp. 173-184, 1983.
[13] B. Frey and N. Jojic, “Transformation-Invariant Clustering Using the EM Algorithm,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 1, pp. 1-17, Jan. 2003.
[14] D. Gatica-Perez, G. Lathoud, J.-M. Odobez, and I.A. McCowan, “Audio-Visual Probabilistic Tracking of Multiple Speakers in Meetings,” IEEE Trans. Audio, Speech, and Language Processing, vol. 15, pp. 601-616, 2007.
[15] D. Geiger and D. Heckerman, “Knowledge Representation and Inference in Similarity Networks and Bayesian Multinets,” Artificial Intelligence, vol. 82, pp. 45-74, 1996.
[16] Z. Ghahramani and M. Jordan, “Factorial Hidden Markov Models,” Machine Learning, vol. 29, pp. 245-273, 1997.
[17] T. Hain, J. Dines, G. Garau, M. Karafiat, D. Moore, V. Wan, R. Ordelman, and S. Renals, “Transcription of Conference Room Meetings: An Investigation,” Proc. Ninth European Conf. Speech Comm. and Technology, 2005.
[18] J. Hershey and J.R. Movellan, “Using Audio-Visual Synchrony to Locate Sounds,” Advances in Neural Information Processing Systems, 1999.
[19] R.A. Jacobs, “Optimal Integration of Texture and Motion Cues to Depth,” Vision Research, vol. 39, no. 21, pp. 3621-3629, Oct. 1999.
[20] N. Jojic and B. Frey, “Learning Flexible Sprites in Video Layers,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR '01), vol. 1, 2001.
[21] N. Jojic, N. Petrovic, B.J. Frey, and T.S. Huang, “Transformed Hidden Markov Models: Estimating Mixture Models of Images and Inferring Spatial Transformations in Video Sequences,” Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR '00), vol. 2, pp. 26-33, June 2000.
[22] K.P. Kording, U. Beierholm, W.J. Ma, S. Quartz, J.B Tenenbaum, and L. Shams, “Causal Inference in Multisensory Perception,” PLoS ONE, vol. 2, no. 9, p. 943, 2007.
[23] D. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge Univ. Press, 2003.
[24] V.K. Mansinghka, C. Kemp, J.B. Tenenbaum, and T.L. Griffiths, “Structured Priors for Structure Learning,” Proc. 22nd Conf. Uncertainty in Artificial Intelligence (UAI), 2006.
[25] A.V. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, “Dynamic Bayesian Networks for Audio-Visual Speech Recognition,” EURAISP J. Applied Signal Processing, vol. 11, pp. 1-15, 2002.
[26] P. Perez, J. Vermaak, and A. Blake, “Data Fusion for Visual Tracking with Particles,” Proc. IEEE, vol. 92, no. 3, pp. 495-513, 2004.
[27] C. Rasmussen and G.D. Hager, “Probabilistic Data Association Methods for Tracking Complex Visual Objects,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, pp. 560-576, 2001.
[28] G.H. Recanzone, “Auditory Influences on Visual Temporal Rate Perception,” J. Neurophysiology, vol. 89, pp. 1078-1093, 2003.
[29] D. Serby, E.-K. Meier, and L. Van Gool, “Probabilistic Object Tracking Using Multiple Features,” Proc. 17th Int'l Conf. Pattern Recognition (ICPR), 2004.
[30] L. Shams, Y. Kamitani, and S. Shimojo, “Illusions: What You See Is What You Hear,” Nature, vol. 408, p. 788, Dec. 2000.
[31] L. Shams, W.J. Ma, and U. Beierholm, “Sound-Induced Flash Illusion as an Optimal Percept,” Neuroreport, vol. 16, no. 17, pp.1923-1927, 2005.
[32] R. Silva and R. Scheines, “Bayesian Learning of Measurement and Structural Models,” Proc. 23rd Int'l Conf. Machine Learning (ICML), 2006.
[33] M.R. Siracusa and J.W. Fisher III, “Dynamic Dependency Tests: Analysis and Applications to Multi-Modal Data Association,” Proc. 11th Int'l Conf. Artificial Intelligence and Statistics (AIStats), 2007.
[34] M. Slaney and M. Covell, “Facesync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks,” Advances in Neural Information Processing Systems, 2000.
[35] Multimodal Technologies for Perception of Humans, LNCS 4122, R.Stiefelhagen and J. Garofolo, eds., Springer, 2007.
[36] L.D. Stone, C.A. Barlow, and T.L. Corwin, Bayesian Multiple Target Tracking. Artech House, 1999.
[37] J. Vermaak, S.J. Godsill, and P. Perez, “Monte Carlo Filtering for Multi Target Tracking and Data Association,” IEEE Trans. Aerospace and Electronic Systems, vol. 41, no. 1, pp. 309-332, Jan. 2005.
[38] S. Vijayakumar, J. Conradt, T. Shibata, and S. Schaal, “Overt Visual Attention for a Humanoid Robot,” Proc. IEEE/RSJ Int'l Conf. Intelligence in Robotics and Systems (IROS), 2001.
[39] C.K.I. Williams and M.K Titsias, “Greedy Learning of Multiple Objects in Images Using Robust Statistics and Factorial Learning,” Neural Computation, vol. 16, no. 5, pp. 1039-1062, May 2004.

Index Terms:
Pattern Recognition, Scene Analysis, Sensor fusion
Citation:
Timothy M. Hospedales, Sethu Vijayakumar, "Structure Inference for Bayesian Multisensory Scene Understanding," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 12, pp. 2140-2157, Dec. 2008, doi:10.1109/TPAMI.2008.25
Usage of this product signifies your acceptance of the Terms of Use.