This Article 
 Bibliographic References 
 Add to: 
Automatic Analysis of Multimodal Group Actions in Meetings
March 2005 (vol. 27 no. 3)
pp. 305-317
This paper investigates the recognition of group actions in meetings. A framework is employed in which group actions result from the interactions of the individual participants. The group actions are modeled using different HMM-based approaches, where the observations are provided by a set of audiovisual features monitoring the actions of individuals. Experiments demonstrate the importance of taking interactions into account in modeling the group actions. It is also shown that the visual modality contains useful information, even for predominantly audio-based events, motivating a multimodal approach to meeting analysis.

[1] F. Kubala, “Rough'n'Ready: A Meeting Recorder and Browser,” ACM Computing Surveys, no. 31, 1999.
[2] N. Morgan, D. Baron, J. Edwards, D. Ellis, D. Gelbart, A. Janin, T. Pfau, E. Shriberg, and A. Stolcke, “The Meeting Project at ICSI,” Proc. Human Language Technology Conf., Mar. 2001.
[3] A. Waibel, M. Bett, F. Metze, K. Ries, T. Schaaf, T. Schultz, H. Soltau, H. Yu, and K. Zechner, “Advances in Automatic Meeting Record Creation and Access,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, May 2001.
[4] S. Renals and D. Ellis, “Audio Information Access from Meeting Rooms,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, 2003.
[5] A. Waibel, T. Schultz, M. Bett, R. Malkin, I. Rogina, R. Stiefelhagen, and J. Yang, “SMaRT: The Smart Meeting Room Task at ISL,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, 2003.
[6] R. Cutler, Y. Rui, A. Gupta, J. Cadiz, I. Tashev, L. He, A. Colburn, Z. Zhang, Z. Liu, and S. Silverberg, “Distributed Meetings: A Meeting Capture and Broadcasting System,” Proc. ACM Multimedia Conf., 2002.
[7] A. Bobick, S. Intille, J. Davis, F. Baird, C. Pinhanez, L. Campbell, Y. Ivanov, A. Schutte, and A. Wilson, “The KidsRoom: A Perceptually-Based Interactive and Immersive Story Environment,” PRESENCE: Teleoperators and Virtual Environments, vol. 8, Aug. 1999.
[8] N. Johnson, A. Galata, and D. Hogg, “The Acquisition and Use of Interaction Behavior Models,” Proc. IEEE Int'l Conf. Computer Vision and Pattern Recognition, June 1998.
[9] T. Jebara and A. Pentland, “Action Reaction Learning: Automatic Visual Analysis and Synthesis of Interactive Behaviour,” Proc. Int'l Conf. Vision Systems, Jan. 1999.
[10] N. Oliver, B. Rosario, and A. Pentland, “A Bayesian Computer Vision System for Modeling Human Interactions,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831-843, Aug. 2000.
[11] S. Hongeng and R. Nevatia, “Multi-Agent Event Recognition,” Proc. IEEE Int'l Conf. Computer Vision, July 2001.
[12] R.F. Bales, Interaction Process Analysis: A Method for the Study of Small Groups. Addison-Wesley, 1951.
[13] J.E. McGrath, Groups: Interaction and Performance. Prentice-Hall, 1984.
[14] J. McGrath and D. Kravitz, “Group Research,” Annual Rev. Psychology, vol. 33, pp. 195-230, 1982.
[15] E. Padilha and J.C. Carletta, “A Simulation of Small Group Discussion,” EDILOG, 2002.
[16] K.C.H. Parker, “Speaking Turns in Small Group Interaction: A Context-Sensitive Event Sequence Model,” J. Personality and Social Psychology, vol. 54, no. 6, pp. 965-971, 1988.
[17] N. Fay, S. Garrod, and J. Carletta, “Group Discussion as Interactive Dialogue or Serial Monologue: The Influence of Group Size,” Psychological Science, vol. 11, no. 6, pp. 487-492, 2000.
[18] I. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud, F. Monay, D. Moore, P. Wellner, and H. Bourlard, “Modeling Human Interactions in Meetings,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, Apr. 2003.
[19] L.R. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition. Prentice-Hall, 1993.
[20] A. Morris, A. Hagen, H. Glotin, and H. Bourlard, “Multi-Stream Adaptive Evidence Combination for Noise Robust ASR,” Speech Comm., 2001.
[21] S. Dupont and J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition,” IEEE Trans. Multimedia, vol. 2, pp. 141-151, Sept. 2000.
[22] M. Brand, N. Oliver, and A. Pentland, “Coupled Hidden Markov Models for Complex Action Recognition,” Proc. IEEE, 1997.
[23] S. Bengio, “An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition,” Advances in Neural Information Processing Systems, NIPS 15, S. Becker, S. Thrun, and K. Obermayer, eds., MIT Press, 2003.
[24] Merriam-Webster Online Dictionary, http:/, 2004.
[25] D. Forsyth, “Measurement in Social Psychological Research,” measure.htm, 2003.
[26] R.F. Bales and S.P. Cohen, SYMLOG: A System for the Multiple Level Observation of Groups. The Free Press, 1979.
[27] K. Ward, C. Marshall, and D. Novick, “Applying Task Classification to Natural Meetings,” Technical Report CS/E 95-011, Oregon Graduate Inst., 1995.
[28] T. Starner and A. Pentland, “Visual Recognition of American Sign Language Using HMMs,” Proc. Int'l Workshop Automated Face and Gesture Recognition, 1995.
[29] D. Novick, B. Hansen, and K. Ward, “Coordinating Turn-Taking with Gaze,” Proc. 1996 Int'l Conf. Spoken Language Processing, 1996.
[30] R. Krauss, C. Garlock, P. Bricker, and L. McMahon, “The Role of Audible and Visible Back-Channel Responses in Interpersonal Communication,” J. Personality and Social Psychology, vol. 35, no. 7, pp. 523-529, 1977.
[31] B. DePaulo, R. Rosenthal, R. Eisenstat, P. Rogers, and S. Finkelstein, “Decoding Discrepant Nonverbal Cues,” J. Personality and Social Psychology, vol. 36, no. 3, pp. 313-323, 1978.
[32] IDIAP Data Distribution, http:/, 2004.
[33] O. Kwon, K. Chan, J. Hao, and T. Lee, “Emotion Recognition by Speech Signals,” Proc. Eurospeech, Sept. 2003.
[34] V. Hozjan and Z. Kacic, “Improved Emotion Recognition with Large Set of Statistical Features,” Proc. Eurospeech, Sept. 2003.
[35] S. Mota and R. Picard, “Automated Posture Analysis for Detecting Learner's Interest Level,” Proc. CVPR Workshop Computer Vision and Pattern Recognition for Human Computer Interaction, June 2003.
[36] B. Wrede and E. Shriberg, “Spotting Hotspots in Meetings: Human Judgments and Prosodic Cues,” Proc. Eurospeech, Sept. 2003.
[37] B. Wrede and E. Shriberg, “The Relationship between Dialogue Acts and Hot Spots in Meetings,” Proc. Automatic Speech Recognition and Understanding Workshop, Dec. 2003.
[38] L. Kennedy and D. Ellis, “Pitch-Based Emphasis Detection for Characterization of Meeting Recordings,” Proc. Automatic Speech Recognition and Understanding Workshop, Dec. 2003.
[39] D. Hillard, M. Ostendorf, and E. Shriberg, “Detection of Agreement vs. Disagreement in Meetings: Training with Unlabeled Data,” Proc. Human Language Technology Conf. North Am. Chapter of the Assoc. for Computational Linguistics, May 2003.
[40] M. Zobl, F. Wallhoff, and G. Rigoll, “Action Recognition in Meeting Scenarios Using Global Motion Features,” Proc. ICVS Workshop Performance Evaluation of Tracking and Surveillance, Mar. 2003.
[41] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud, “Modeling Individual and Group Actions in Meetings: A Two-Layer HMM Framework,” Proc. IEEE CVPR Workshop Event Mining: Detection and Recognition of Events in Video, 2004.
[42] J.S. Boreczky and L.D. Wilcox, “A Hidden Markov Model Framework for Video Segmentation Using Audio and Image Features,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, vol. 6, pp. 3741-3744, 1998.
[43] L. Xie, S.-F. Chang, A. Divakaran, and H. Sun, “Structure Analysis of Soccer Video with Hidden Markov Models,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 2002.
[44] S. Eickeler and S. Müller, “Content-Based Video Indexing of TV Broadcast News Using Hidden Markov Models,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, pp. 2997-3000, 1999.
[45] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum-Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977.
[46] A. Viterbi, “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Trans. Information Theory, pp. 260-269, 1967.
[47] N. Oliver, E. Horvitz, and A. Garg, “Layered Representations for Learning and Inferring Office Activity from Multiple Sensory Channels,” Proc. Int'l Conf. Multimodal Interfaces, Oct. 2002.
[48] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, “Audio-Visual Automatic Speech Recognition: An Overview,” Issues in Visual and Audio-Visual Speech Processing, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, eds., MIT Press, 2004.
[49] M. Brand, “Coupled Hidden Markov Models for Modeling Interacting Processes,” Technical Report 405, MIT Media Lab Vision and Modeling, Nov. 1996.
[50] A. Dielmann and S. Renals, “Dynamic Bayesian Networks for Meeting Structuring,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, May 2004.
[51] D. Moore, “The IDIAP Smart Meeting Room,” IDIAP Comm. 02-07, 2002.
[52] J. DiBiase, “A High-Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments,” PhD thesis, Brown Univ., Providence, R.I., 2000.
[53] J. DiBiase, H. Silverman, and M. Brandstein, “Robust Localization in Reverberant Rooms,” Microphone Arrays, M. Brandstein and D. Ward, eds., chapter 8, pp. 157-180, Springer, 2001.
[54] G. Lathoud, I.A. McCowan, and D.C. Moore, “Segmenting Multiple Concurrent Speakers Using Microphone Arrays,” Proc. Eurospeech 2003, Sept. 2003.
[55] J.D. Markel, “The SIFT Algorithm for Fundamental Frequency Estimation,” IEEE Trans. Audio and Electroacoustics, vol. 20, pp. 367-377, 1972.
[56] N. Morgan and E. Fosler-Lussier, “Combining Multiple Estimators of Speaking Rate,” Proc. 1998 IEEE Int'l Conf. Acoustics, Speech, and Signal Processing, 1998
[57] D. Moore and I. McCowan, “Microphone Array Speech Recognition: Experiments on Overlapping Speech in Meetings,” Proc. Int'l Conf. Acoustics, Speech, and Signal Processing, Apr. 2003.
[58] D. Gatica-Perez, G. Lathoud, I. McCowan, and J.-M. Odobez, “A Mixed-State i-particle Filter for Multi-Camera Speaker Tracking,” Proc. WOMTEC, Sept. 2003.
[59] M. Jones and J. Rehg, “Statistical Color Models with Application to Skin Detection,” Int'l J. Computer Vision, vol. 46, pp. 81-96, Jan. 2002.
[60] C. Stauffer, “Adaptive Background Mixture Models for Real-Time Tracking,” Proc. IEEE Computer Vision and Pattern Recognition, pp. 246-252, 1999.
[61] R. Collobert, S. Bengio, and J. Mariéthoz, “Torch: A Modular Machine Learning Software Library,” Technical Report IDIAP-RR 46, IDIAP, Martigny, Switzerland, 2002.
[62] http:/, 2004.
[63] D. Gatica-Perez, I. McCowan, M. Barnard, S. Bengio, and H. Bourlard, “On Automatic Annotation of Meeting Databases,” Proc. Int'l Conf. Image Processing, 2003.
[64] prc/section3prc33.htm, 2004.

Index Terms:
Statistical models, multimedia applications and numerical signal processing, computer conferencing, asynchronous interaction.
Iain McCowan, Daniel Gatica-Perez, Samy Bengio, Guillaume Lathoud, Mark Barnard, Dong Zhang, "Automatic Analysis of Multimodal Group Actions in Meetings," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 3, pp. 305-317, March 2005, doi:10.1109/TPAMI.2005.49
Usage of this product signifies your acceptance of the Terms of Use.