"What Are the Essential Cues for Speech Intelligibility?"
Steven Greenberg
University of California, Berkeley, California, USA

    Traditional models of speech assume that a detailed auditory analysis of the short-term acoustic spectrum is essential for understanding spoken language. The validity of this assumption is called into question as a consequence of two perceptual experiments.
    In the first study the spectrum of spoken sentences was partitioned into quarter-octave channels and the onset of each channel shifted in time relative to the others so as to desynchronize spectral information across the frequency axis.  Human listeners are remarkably tolerant of cross-channel spectral asynchrony induced in this fashion. Speech intelligibility remains relatively unimpaired until the average asynchrony spans three or more phonetic segments. Such perceptual robustness is correlated with the magnitude of the low-frequency (3-6 Hz) modulation spectrum and thus highlights the importance of syllabic segmentation and analysis for robust processing of spoken language. High-frequency channels (>1.5 kHz) play a particularly important role when the spectral asynchrony is sufficiently large as to significantly reduce the power in the low-frequency modulation spectrum (analogous to acoustic reverberation) and may thereby account for the deterioration of speech intelligibility among the hearing impaired under conditions of acoustic interference (such as background noise and reverberation) characteristic of the real world.
    The second experiment partitioned the spectrum of spoken sentences into 1/3-octave channels ("slits") and measured the intelligibility associated with each channel presented alone and in concert with the others. Four spectral channels, distributed over the speech-audio range (0.3-6 kHz) are sufficient for human listeners to decode sentential material with nearly 90% accuracy although more than 70% of the spectrum is missing. Word recognition often remains relatively high (60- 83%) when just two or three channels are presented concurrently, despite the fact that the intelligibility of these same slits, presented in isolation, is less than 9%.  Such data suggest that the intelligibility of spoken language is derived from a compound "image" of the modulation spectrum distributed across the frequency spectrum. Because intelligibility seriously degrades when slits are desynchronized by more than 25 ms this compound image is probably derived from both the amplitude and phase components of the modulation spectrum, and implies that listeners' sensitivity to the modulation phase is generally "masked" by the redundancy contained in full-spectrum speech.