Multiple Stream Speech Recognition
This project has two components.
(1) Cortically-inspired speech recognition: Acoustic events such as speech exhibit distinctive spectro-temporal amplitude modulations. These types of modulations are not well captured by conventional feature extraction methods, which involve either spectral processing or temporal processing at a time.
Recent findings from mammalian-auditory-cortical receptive field measurements suggest that biological systems are highlyhighly to spectro-temporal modulations. The spectro-temporal receptive fields (STRFs) of cortical cells are found to resemble 2-D spectro-temporal Gabor filters. In prior work, researchers have used 2-D Gabor filters to extract spectro-temporal features for speech recognition and speech discrimination. However, these studies have involved only single streams of task-optimized features to very large multi-dimensional representations of spectro-temporal responses. Therefore, there is a need to explore the use of multiple streams of spectro-temporal features, which may preserve the organizational map of STRFs and alleviate cumbersome computation of sizable data, in speech recognition.
This research aims to develop, evaluate, and incorporate multi-stream spectro-temporal features for robust speech recognition.
(2) Parallel processing for speech recognition: In noisy or reverberant environments, more processing will be needed for speech recognition. If a mobile device is used then the device will often be elsewhere than right up near the user's mouth, which will hurt ASR. For instance, in the most recent NIST evaluations, the best word error rate for multi-microphone speech recognition in a conference room was about 40%. This used beamforming, but as yet does not have the techniques we propose below, which have the potential of significantly reducing this error rate, at the expense of using much more computational power.
A parallel processing approach that could help further is the multi-stream methodology, in which multiple signal representations are used to generate posterior probabilities of speech sound classes, and then are combined and further transformed (Gaussianized and orthogonalized) to generate input features for a statistical speech recognition engine. Multi-layer perceptrons generate the individual posterior probabilities. These methods have been successfully used for 2-15 streams, but we would ultimately like to work with much larger ensembles of feature generators. We will start our work using the Quicknet libraries that were developed at ICSI, parallelizing it for the target approaches discussed in this proposal. We will then develop code that incorporates these libraries in a system that permits experimentation and ultimately exhibits much greater robustness for speech recognition in moderate noise and reverberation with microphones that are not head-mounted. This work is closely connected with the Berkeley ParLab >>
