The Community for Technology Leaders
Green Image
Large amounts of hand-transcribed speech training data are required to build or improve speech recognition systems using current technologies. Just like in the movie "E.T., the Extra-Terrestrial," intelligent systems should be able to automatically learn by watching examples. In this paper, we present a way to use television broadcasts with closed-captioned text as a source for large amounts of transcribed speech training data. The errorful closed captioned text is aligned with the also errorful speech recognizer output. Matching segments of sufficient length are assumed to be reliable transcriptions of the corresponding speech. These segments are used as the training data for an improved speech recognizer. With this technique we automatically extracted 131 hours of transcribed speech data from a corpus of 709 hours of TV broadcasts. In conjunction with 66 hours of manually transcribed training data, we improved the word error rate of our currently best speech recognition system (Sphinx-III) during a single-pass, narrow beam decoding from 32.82% to 31.19%. The 131 hours of automatically derived data alone resulted in a word error rate of 30.34%, A speech recognition system trained exclusively on just 70 hours of this automatically transcribed data produced a word error rate of 32.7%. Selecting a 66 hour subset of the automatic data by eliminating utterances with triphones above a threshold frequency yielded an error rate of 30.59%. The improved Sphinx-III speech recognition system is used in the Informedia Digital Video Library to create searchable transcripts of indexed video material.

A. G. Hauptmann and P. J. Jang, "Learning to Recognize Speech by Watching Television," in IEEE Intelligent Systems, vol. 14, no. , pp. 51-58, 1999.
90 ms
(Ver 3.3 (11022016))