Traditional models of speech assume that a detailed
auditory analysis of the short-term acoustic spectrum is essential for
understanding spoken language. The validity of this assumption is called
into question as a consequence of two perceptual experiments.
In the first study the spectrum of spoken sentences
was partitioned into quarter-octave channels and the onset of each channel
shifted in time relative to the others so as to desynchronize spectral
information across the frequency axis. Human listeners are remarkably
tolerant of cross-channel spectral asynchrony induced in this fashion.
Speech intelligibility remains relatively unimpaired until the average
asynchrony spans three or more phonetic segments. Such perceptual robustness
is correlated with the magnitude of the low-frequency (3-6 Hz) modulation
spectrum and thus highlights the importance of syllabic segmentation and
analysis for robust processing of spoken language. High-frequency channels
(>1.5 kHz) play a particularly important role when the spectral asynchrony
is sufficiently large as to significantly reduce the power in the low-frequency
modulation spectrum (analogous to acoustic reverberation) and may thereby
account for the deterioration of speech intelligibility among the hearing
impaired under conditions of acoustic interference (such as background
noise and reverberation) characteristic of the real world.
The second experiment partitioned the spectrum of
spoken sentences into 1/3-octave channels ("slits") and measured the intelligibility
associated with each channel presented alone and in concert with the others.
Four spectral channels, distributed over the speech-audio range (0.3-6
kHz) are sufficient for human listeners to decode sentential material with
nearly 90% accuracy although more than 70% of the spectrum is missing.
Word recognition often remains relatively high (60- 83%) when just two
or three channels are presented concurrently, despite the fact that the
intelligibility of these same slits, presented in isolation, is less than
9%. Such data suggest that the intelligibility of spoken language
is derived from a compound "image" of the modulation spectrum distributed
across the frequency spectrum. Because intelligibility seriously degrades
when slits are desynchronized by more than 25 ms this compound image is
probably derived from both the amplitude and phase components of the modulation
spectrum, and implies that listeners' sensitivity to the modulation phase
is generally "masked" by the redundancy contained in full-spectrum speech.