Incorporating Information From Syllable-Length Time Scales into Automatic Speech Recognition

TitleIncorporating Information From Syllable-Length Time Scales into Automatic Speech Recognition
Publication TypeTechnical Report
Year of Publication1998
AuthorsWu, S-L.
Other Numbers1135
Keywordscombination, human auditory perception, neural network, reverberation, speech recognition, syllabic onsets, syllable

Incorporating the concept of the syllable into speech recognition may improve recognition accuracy through the integration of information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit. Nonetheless, the explicit use of such long-time-span units is comparatively unusual in automatic speech recognition systems for English. The work described in this thesis explored the utility of information collected over syllable-related time-scales. The first approach involved integrating syllable segmentation information into the speech recognition process. The addition of acoustically-based syllable onset estimates (Shire 1997) resulted in a 10% relative reduction in word-error rate. The second approach began with developing four speech recognition systems based on long-time-span features and units, including modulation spectrogram features (Greenberg & Kingsbury 1997). Error analysis suggested the strategy of combining, which led to the implementation of methods that merged the outputs of syllable-based recognition systems with the phone-oriented baseline system at the frame level, the syllable level and the whole-utterance level. These combined systems exhibited relative improvements of 20-40% compared to the baseline system for clean and reverberant speech test cases.

Bibliographic Notes

ICSI Technical Report TR-98-014

Abbreviated Authors

S.-L. Wu

ICSI Research Group


ICSI Publication Type

Technical Report