"Confusability"
| printz | us.ibm.com |
|---|
A language model is a function that returns the probability that any given sequence of words will appear in a very large corpus of naturally generated text. Such models lie at the heart of statistical speech recognition and machine translation systems.
One common way of constructing a language model is to define it in terms of a collection of parameters, and then adjust those parameters to maximize the probability that the model assigns to a training corpus. This is an instance of maximum likelihood modeling; it is equivalent to minimizing the model's perplexity on the given corpus. But it is well-known that the performance of speech recognition systems is not well-correlated with language model perplexity, hereafter called lexical perplexity. In particular it can easily happen that some new insight or technique lowers the lexical perplexity, but raises the word error rate.
In this talk, we argue that this conundrum arises from designing the model in isolation from the channel with which it will be used. Essentially, we propose that language models should be built to help with the hard parts of speech recognition, or the source-channel decoding task at hand. After all, it's not that hard to tell "nostril" from "rutabaga." But when you can separate "Austin" and "Boston," you know you're doing well.
In the first part, we analyze the operation of a language model in a source-channel decoding scheme. We define acoustic perplexity, a statistic that incorporates the characteristics of both the source (language model) and the channel (acoustic model). We show how this notion depends upon a still more fundamental expression, the acoustic confusability of word pairs.
In the second part, we present an algorithm for computing acoustic confusability, which can be applied directly to the well-known hidden Markov model paradigm, and which encompasses ALL possible paths through such models, of ALL possible lengths. From these confusability numbers, we show how to obtain both the acoustic perplexity, and another new measure of goodness for language models: the synthetic acoustic word error rate. We present experimental evidence that demonstrates that these measures are better correlated with word error rate than lexical perplexity.
In the third part, we show how a language model may be directly trained to minimize the synthetic acoustic word error rate, and give recognizer performance for a simple language model, so trained. We finish by discussing other applications of these ideas, notably to such varied areas as vocabulary definition, feature selection for maximum entropy models, and statistical machine translation.
This is joint work with Peder Olsen.