"Mean Subtraction for Automatic Speech Recognition (ASR) in
Reverberation"
www.icsi.berkeley.edu/~gelbart
Speech signals are often subject to convolutional distortions due to the properties of communications and recording equipment and the acoustics of the physical environment. It is common in ASR systems to perform some kind of blind deconvolution using cepstral mean subtraction (CMS) or the related RASTA technique. Working with reverberant test data, I experimentally compare the performance of CMS to two other approaches, which share the principle used in CMS of blind deconvolution by mean subtraction in a logarithmic spectral or cepstral domain: Neumeyer et al.'s log-DFT mean normalization (LDMN) and Avendano et al.'s long-term log spectral subtraction (LTLSS). The LDMN avoids an assumption of constancy of the frequency response of the distortion within mel spectral bands by perform the mean subtraction before the mel integration. The LTLSS performs the mean subtraction using a spectral representation calculated using windowed DFTs where the window is much longer than is usually used for ASR feature extraction; this is done because of the long temporal extent of reverberation.
Attempting to settle the question of whether LTLSS gives improved deconvolution performance in reverberant environments through the methodology of ASR accuracy tests is greatly complicated by several confounding factors. One is that the logarithm gives values near 0 an exaggerated effect on the mean, something which a shorter window length is more sensitive to. The other is that LTLSS introduces noise artifacts, which ironically might increase performance by degrading non-reverberant and low-noise training data so that it better prepares the recognizer to face the distortion of reverberation. I attempt to address these confounding factors by removing low-energy value from the mean calculation when using shorter window lengths, trying noisy training data, and removing noise artifacts from LTLSS output using a method based on time-frequency masking (http://www.icsi.berkeley.edu/Speech/papers/gelbart-ms/mask).