Su-Lin Wu
ICSI/UC Berkeley
| sulin | icsi.berkeley.edu |
|---|
"Incorporating Information from Syllable-length Time Scales into Automatic Speech Recognition"
Incorporating the concept of the syllable into automatic speech recognition may improve recognition accuracy by helping to integrate information over syllable-length time spans. Evidence from psychoacoustics and phonology suggests that humans use the syllable as a basic perceptual unit in speech processing. Nonetheless, the explicit use of such long-time-span units is comparatively unusual in modern automatic speech recognition systems for English.
The work to be described in this talk explored the utility of
information collected over syllable-based time-scales. The first
approach involved integrating syllable segmentation information into
the speech recognition process. The addition of acoustically-estimated
syllabic onsets resulted in a 10% relative reduction in word-error
rate. The second approach began with developing four speech
recognition systems based on long-time-span features and units,
including modulation spectrogram features. Analysis suggested the
strategy of combining, which led to the implementation of methods that
merged the outputs of syllable-based recognition systems with the
phone-oriented baseline system at the frame level, the syllable level
and the whole-utterance level. These combined systems exhibited
relative improvements of 20-40% compared to the baseline system for
clean and reverberant speech test cases.