2001 IEEE International Conference on Multimedia and Expo (ICME'01) SPEECH DETECTION BY FACIAL IMAGE FOR MULTIMODAL SPEECH RECOGNITION Tokyo, Japan August 22-August 25 ISBN: 0-7695-1198-8
In this paper, we propose a method to detect speech by facial images for multi-modal speech recognition. It is widely acknowledged that the accuracy of speech detection contributes to overall speech recognition performance. While audio modal speech detection performs well under clean conditions, the performance degrades with audio noise. So, we have conducted research on video modality speech detection, which is robust not only to the audio noise but also to the speaker's motion and other video modality disturbances[1]. However, accuracy of detection suffers because duration of the speech motion is intrinsically longer than the speech. Thus, the proposed method detects the section that includes the speech by means of robust video modality speech detection followed by audio modality speech detection to enhance the accuracy. Our method locates the face area by skin color and estimates the region that includes the speech organs. Then the speech is detected from the magnitude of the image alternation without explicitly detecting any organs. An experiment also confirms that the proposed method improves the speech recognition rate under a noisy environment (SNR 10dB) as well as the audio track of a VCR (SNR 25.4 dB).
Citation:
K. Murai, K. Kumatani, S. Nakamura, "SPEECH DETECTION BY FACIAL IMAGE FOR MULTIMODAL SPEECH RECOGNITION," icme, pp.275, 2001 IEEE International Conference on Multimedia and Expo (ICME'01), 2001 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||