Project Ouch - Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)

Principal Investigator(s): 
Nelson Morgan, Steven Wegmann, Jordan Cohen

Project OUCH has been completed, and the final report is available here.

The central idea behind this project is that if we want to improve recognition performance through acoustic modeling, then we should first quantify how the current best model — the hidden Markov model (HMM) — fails to adequately model speech data and how these failures impact recognition accuracy. We are undertaking a diagnostic analysis that is an essential component of statistical modeling but, for various reasons, has been largely ignored in the field of speech recognition. In particular, we believe that previous attempts to improve upon the HMM have largely failed because this diagnostic information was not readily available. In our initial research, we are using simulation and a novel sampling process to generate pseudo test data that deviate from the HMM in a controlled fashion. These processes allow us to generate pseudo data that, at one extreme, agree with all of the model's assumptions, and at the another extreme, deviate from the model in exactly the way real data does. In between, we precisely control the degree of data/model mismatch. By measuring recognition performance on this pseudo test data, we are able to quantify the effect of this controlled data/model residual on recognition accuracy.

When applying HMMs to the problem of automatic speech recognition, there are two main assumptions that we make. The first assumption is the choice of the parametric models that we use for the HMM's output distributions. This is almost always assumed to be multivariate normal with diagonal covariance. The second assumption is the statistical independence of frames. More precisely, we assume that successive frames "generated" by a certain state are independent, moreover, that frames generated in one state are independent of those generated by a different state. While both of these assumptions are understood to be false for speech data, it is reasonable to wonder if we can understand what impact each of these two assumptions have on recognition accuracy and in particular if one assumption seems to dominate recognition errors. This type of diagnostic information should be a critical first step towards improving upon or replacing the HMM for speech recognition.

The diagnostic research described in Gillick 2011 uses simulation and a novel sampling process to generate pseudo test data that deviate from the HMM in a controlled fashion. These processes allow us to generate pseudo data that, at one extreme, agree with all of the model's assumptions, and at the another extreme, deviate from the model in exactly the way real data do. In between, we can precisely control the degree of data/model mismatch. By measuring recognition performance on this pseudo test data, we are able to quantify the effect of this controlled data/model residual on recognition accuracy.

The novel sampling process, called resampling, was adapted from Bradley Efron's work on the bootstrap. In essence resampling is a non-parametric analog of simulating data from a known parametric distribution. We have at our disposal an i.i.d sample from a unknown population distribution. Instead of fitting a parametric model to this sample, which would result in a parametric approximation to the unknown population distribution, we take the sample itself as the approximation (via the empirical distribution derived from the sample). To simulate using this empirical distribution, we simply do random draws (with replacement) from the sample, hence the terminology resampling.

In earlier work Wegmann 2010 we used resampling on the WSJ corpus to fabricate test data that had the same marginal distribution. By marginal distribution we mean the distribution of the collection of frames assigned to each HMM state, i.e. we are marginalizing with respect to time. as real data but agreed with the HMM's independence assumptions. We also fabricated test data that agreed with all of the HMM's assumptions by simulating from the HMM. We found that both sets of fabricated test data had nearly zero WERs - especially when compared to the WER of real test data, which was about 18%. This suggested that real data's disagreement with the HMM's parametric output distributions was a relatively minor issue compared to the data's disagreement with the HMM's independence assumptions, and in Wegmann 2010 we presented further evidence to support this hypothesis based on somewhat ad hoc methodology.

In Gillick 2011, we extended resampling to operate not just at the frame level, as in Wegmann 2010, but at the segment level, where by segment we mean a sequence of frames aligned to a particular state, a triphone, or a word. Fabricated test data created by using segment level resampling has the following properties:

1. The marginal distribution of the data is the same as real data. 

2. Between segments, the data is statistically independent.

3. Within segments, the data inherits the statistical dependence present in real data. In Gillick 2011 we compare recognition accuracy on test data fabricated using simulation from the HMM, resampling at the frame, state, triphone and word level, and real test data on the WSJ and SWB corpora. On both corpora the overall take-away from these comparisons is identical. First, the comparison of WERs on real test data to test data fabricated by frame level resampling: here we preserve the marginal distribution of the data but resampling forces the fabricated data to satisfy the independence assumptions and the WER drops by 90% (relative). As we gradually re-introduce statistical dependence, which is inherited from real data, by fabricating test data using state, triphone, and word level resampling, the WER steadily increases. The largest jump in WER occurs when we re-introduce within state dependence (using state level resampling - in this case, the WER jumps by a factor of 6), but between state and between triphone dependence also induce surprisingly large WER increases (the factor is on the order of 1.5 to 2). Also, the WER on test data fabricated using word level resampling is nearly identical to the WER on real test data. Thus we conclude that long range statistical dependence that is present in speech data and at variance with the HMM's assumptions is the single largest source of recognition errors, dwarfing the errors due to the data violating our choice of output distributions.

In Gillick 2012 we present another application of diagnostic analysis using resampling. Maximum likelihood estimation for HMMs has the following desirable property: if the data satisfies the assumptions of the model, then as the amount of training data goes to infinity, the global parameter estimate will be optimal in that it is asymptotically unbiased with minimum variance. In practice, however, training sets are limited in size, EM only guarantees convergence to a local, rather than global optimum, and actual speech data clearly violates the model assumptions. As a result, a variety of other estimation procedures can yield parameters that give better performance. In particular, discriminative training schemes like MMI and more recently MPE have shown significant improvement over maximum likelihood estimation. Is there a meaningful qualitative description of how the discriminatively trained parameters differ from the maximum likelihood parameters? To the best of our knowledge, there has been no empirical investigation of this matter for speech recognition. In Gillick 2012 we present a series of experiments to demonstrate that the standard discriminative training procedures do not improve the models of the states' output distributions; somewhat surprisingly, they appear to compensate for the incorrect assumptions of independence, even beyond the state level. These experiments compare the performance of models trained using MPE on real test data and test data fabricated using frame, state, triphone, and word level resampling.

1. We demonstrate that: MPE does not improve on maximum likelihood in modeling state output distributions, and

2. MPE appears to compensate for statistical dependence, in particular within states. While discriminative training may also improve on ML in other ways, this is a fairly indirect method for temporal dynamics. If we can understand a little more about how MPE adjusts for dependence - perhaps by normalizing phoneme or state level scores - we might be able to benefit by modeling dependence directly. Such insight would be especially valuable given the halting progress of segmental modeling techniques that aim to model such dependence.

We believe that the papers Wegmann 2010, Gillick 2011, and Gillick 2012 are interesting not just because of the surprising experimental results that they contain, but also because they should help to popularize simulation as a powerful and flexible methodology for gaining a deeper understanding of the properties of speech recognition algorithms. Furthermore, we believe that understanding how real data are at variance with the models that we use for speech recognition is a largely unexplored but fertile area for future research.

Finally, we are releasing our software libraries under the BSD open source software license PLASTR. Clearly, this will allow other researchers to duplicate and extend our results. However, these libraries also include our python wrapper for the HTK training and testing tools. This wrapper should significantly lower the barrier that currently prevents all but the largest and most experienced speech recognition laboratories from creating state of the art large vocabulary speech recognition systems.

Wegmann 2010 S. Wegmann and L. Gillick (2010). "Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?," in arXiv.org under tt arXiv:1003.0206 [cs.CL].

Gillick 2011 D. Gillick, L. Gillick, and S. Wegmann (2011). "Don’t Multiply Lightly: Quantifying Problems with the Acoustic Model Assumptions in Speech Recognition," in Proc. of ASRU 2011, 71-76.

Gillick 2012 D. Gillick, L. Gillick, and S. Wegmann (2012). "Discriminative Training for Speech Recognition is Compensating for Statistical Dependence in the HMM Framework," to appear in Proc. of ICASSP 2012.

PLASTR D. Gillick and S. Wegmann (2011). "PLASTR: a Python Library for Automatic Speech recognition Training and Recognition," http://code.google.com/p/pyhtk/.

Funding provided by IARPA.