The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.01 - January (2011 vol.33)
pp: 101-116
Sileye O. Ba , LabSTICC, Ecole Nationale des Télécommunications de Bretagne, Technopole Brest-Iroise
Jean-Marc Odobez , Idiap Research Institute, Martigny
ABSTRACT
This paper introduces a novel contextual model for the recognition of people's visual focus of attention (VFOA) in meetings from audio-visual perceptual cues. More specifically, instead of independently recognizing the VFOA of each meeting participant from his own head pose, we propose to jointly recognize the participants' visual attention in order to introduce context-dependent interaction models that relate to group activity and the social dynamics of communication. Meeting contextual information is represented by the location of people, conversational events identifying floor holding patterns, and a presentation activity variable. By modeling the interactions between the different contexts and their combined and sometimes contradictory impact on the gazing behavior, our model allows us to handle VFOA recognition in difficult task-based meetings involving artifacts, presentations, and moving people. We validated our model through rigorous evaluation on a publicly available and challenging data set of 12 real meetings (5 hours of data). The results demonstrated that the integration of the presentation and conversation dynamical context using our model can lead to significant performance improvements.
INDEX TERMS
Visual focus of attention, conversational events, multimodal, contextual cues, dynamic Bayesian network, head pose, meeting analysis.
CITATION
Sileye O. Ba, Jean-Marc Odobez, "Multiperson Visual Focus of Attention from Head Pose and Meeting Contextual Cues", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.33, no. 1, pp. 101-116, January 2011, doi:10.1109/TPAMI.2010.69
REFERENCES
[1] P. Wellner, M. Flynn, and M. Guillemot, "Browsing Recorded Meetings with Ferret," Proc. Workshop Machine Learning and Multimodal Interaction, 2004.
[2] A. Popescu-Belis, E. Boertjes, J. Kilgour, P. Poller, S. Castronovo, T. Wilson, A. Jaimes, and J. Carletta, "The AMIDA Automatic Content Linking Device: Just-in-Time Document Retrieval in Meetings," Proc. Workshop Machine Learning and Multimodal Interaction, 2008.
[3] T. Kleinbauer, S. Becker, and T. Becker, "Combining Multiple Information Layers for the Automatic Generation of Indicative Meeting Abstracts," Proc. European Workshop Natural Language Generation, 2007.
[4] S. Duncan,Jr., "Some Signals and Rules for Taking Speaking Turns in Conversations," J. Personality and Social Psychology, vol. 23, no. 2, pp. 283-292, 1972.
[5] I. McCowan, D. Gatica-Perez, G. Lathoud, M. Barnard, and D. Zhang, "Automatic Analysis of Group Actions in Meetings," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 305-317, Mar. 2005.
[6] S. Favre, H. Salamin, A. Vinciarelli, D.H. Tur, and N.P. Garg, "Role Recognition for Meeting Participants: An Approach Based on Lexical Information and Social Network Analysis," Proc. ACM Int'l Conf. Multimedia, 2008.
[7] A. Kendon, "Some Functions of Gaze-Direction in Social Interaction," Acta Psychologica, vol. 26, pp. 22-63, 1967.
[8] N. Jovanovic and H. Op den Akker, "Towards Automatic Addressee Identification in Multi-Party Dialogues," Proc. SIGdial, 2004.
[9] O. Kulyk, J. Wang, and J. Terken, "Real-Time Feedback on Nonverbal Behaviour to Enhance Social Dynamics in Small Group Meetings," Proc. Workshop Machine Learning for Multimodal Interaction, 2006.
[10] M. Kouadio and U. Pooch, "Technology on Social Issues of Videoconferencing on the Internet: A Survey," J. Network and Computer Applications, vol. 25, pp. 37-56, 2002.
[11] T. Ohno, "Weak Gaze Awareness in Video-Mediated Communication," Proc. Conf. Human Factors in Computing Systems, pp. 1709-1712, 2005.
[12] C. Morimoto and M. Mimica, "Eye Gaze Tracking Techniques for Interactive Applications," Computer Vision and Image Understanding, vol. 98, pp. 4-24, 2005.
[13] S. Langton, R. Watt, and V. Bruce, "Do the Eyes Have It? Cues to the Direction of Social Attention," Trends in Cognitive Sciences, vol. 4, no. 2 pp. 50-58, 2000.
[14] R. Stiefelhagen, J. Yang, and A. Waibel, "Modeling Focus of Attention for Meeting Indexing Based on Multiple Cues," IEEE Trans. Neural Networks, vol. 13, no. 4, pp. 928-938, July 2002.
[15] S. Ba and J.-M. Odobez, "Recognizing Human Visual Focus of Attention from Head Pose in Meetings," IEEE Trans. Systems, Man, and Cybernetics, vol. 39, no. 1, pp. 16-33, Feb. 2009.
[16] K. Otsuka, J. Yamato, Y. Takemae, and H. Murase, "Conversation Scene Analysis with Dynamic Bayesian Network Based on Visual Head Tracking," Proc. IEEE Int'l Conf. Multimedia and Expo, 2006.
[17] M. Argyle and J. Graham, "The Central Europe Experiment— Looking at Persons and Looking at Things," J. Environmental Psychology and Nonverbal Behaviour, vol. 1, pp. 6-16, 1977.
[18] K. van Turnhout, J. Terken, I. Bakx, and B. Eggen, "Identifying the Intended Addressee in Mixed Human-Human and Human-Computer Interaction from Non-Verbal Features," Proc. Int'l Conf. Multimodal Interfaces, 2005.
[19] L. Chen, M. Harper, A. Franklin, T. Rose, and I. Kimbara, "A Multimodal Analysis of Floor Control in Meetings," Proc. Workshop Machine Learning for Multimodal Interaction, 2005.
[20] K. Otsuka, Y. Takemae, J. Yamato, and H. Murase, "A Probabilistic Inference of Multiparty-Conversation Structure Based on Markov-Switching Models of Gaze Patterns, Head Directions, and Utterances," Proc. Int'l Conf. Multimodal Interfaces, pp. 191-198, 2005.
[21] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, and M. Kronenthal, "The AMI Meeting Corpus: A Pre-Announcement," Proc. Workshop Machine Learning for Multimodal Interaction, 2005.
[22] D. Zhang, D. Gatica-Perez, S. Bengio, and I. McCowan, "Modeling Individual and Group Actions in Meetings with Layered HMMs," IEEE Trans. Multimedia, vol. 8, no. 3, pp. 509-520, 2006.
[23] P. Dai, H. Di, L. Dong, L. Tao, and G. Xu, "Group Interaction Analysis in Dynamics Context," IEEE Trans. Systems, Man, and Cybernetics—Part B, vol. 38, no. 1, pp. 275-282, Feb. 2008.
[24] S. Basu, "Conversational Scene Analysis," PhD dissertation, Massachusset Inst. of Tech nology, 2002.
[25] M. Katzenmeir, R. Stiefelhagen, and T. Schultz, "Identifying the Addressee in Human-Human-Robot Interactions Based on Head Pose and Speech," Proc. Int'l Conf. Multimodal Interfaces, 2004.
[26] N. Jovanovic, R. op den Akker, and A. Nijholt, "Addressee Identification in Face-to-Face Meetings," Proc. 11th Conf. European Chapter of the Assoc. for Computational Linguistics, 2006.
[27] K. Smith, S. Ba, D. Gatica-Perez, and J. Odobez, "Tracking Attention for Multiple People: Wandering Visual Focus of Attention Estimation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 7, pp. 1212-1229, July 2008.
[28] M. Siracusa, L. Morency, K. Wilson, J. Fisher, and T. Darrell, "A Multi-Modal Approach for Determining Speaker Location and Focus," Proc. Int'l Conf. Multimodal Interfaces, 2003.
[29] M. Voit and R. Stiefelhagen, "Deducing the Visual Focus of Attention from Head Pose Estimation in Dynamic Multi-View Meeting Scenarios," Proc. Int'l Conf. Multimodal Interfaces (ICMI), Oct. 2008.
[30] M. Hayhoe and D. Ballard, "Eye Movements in Natural Behavior," Trends in Cognitive Sciences, vol. 9, no. 4, pp. 188-194, 2005.
[31] N. Jovanovic, "To Whom It May Concern: Adressee Identification in Face-to-Face Meetings," PhD thesis, Univ. of Twente, 2007.
[32] S.O. Ba and J.-M. Odobez, "A Rao-Blackwellized Mixed State Particle Filter for Head Pose Tracking," Proc. ICMI Workshop Multimodal Multiparty Meeting Processing, pp. 9-16, 2005.
[33] C. Yeo and K. Ramchandran, "Compressed Domain Video Processing of Meetings for Activity Estimation in Dominance Classification and Slide Transition Detection," Technical Report UCB/EECS-2008-79, Univ. of California, Berkeley, June 2008.
[34] S. Ba and J. Odobez, "Multi-Party Focus of Attention Recognition in Meetings from Head Pose and Multimodal Contextual Cues," Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, 2008.
[35] J.-M. Odobez and S. Ba, "A Cognitive and Unsupervised MAP Adaptation Approach to the Recognition of Focus of Attention from Head Pose," Proc. IEEE Int'l Conf. Multimedia and Expo, 2007.
[36] E.G. Freedman and D.L. Sparks, "Eye-Head Coordination during Head-Unrestrained Gaze Shifts in Rhesus Monkeys," J. Neurophysiology, vol. 77, pp. 2328-2348, 1997.
[37] V. Pavlovic, J.M. Rehg, and J.M. Cormick, "Learning Switching Linear Models of Human Motion," Advances in Neural Information Processing Systems, pp. 981-987, MIT Press, 2000.
[38] J. Gauvain and C.H. Lee, "Bayesian Learning for Hidden Markov Model with Gaussian Mixture State Observation Densities," Speech Comm., vol. 11, pp. 205-213, 1992.
[39] V. Durkalski, Y. Palesch, S. Lipsitz, and P. Rust, "Analysis of Clustered Matched-Pair Data," Statistical Medicine, vol. 22, no. 15, pp. 2417-2428, Aug. 1975.
[40] A. Dielmann and S. Renals, "Automatic Meeting Segmentation Using Dynamic Bayesian Networks," IEEE Trans. Multimedia, vol. 9, no. 1, pp. 25-36, 2007.
25 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool