The algorithm described here has been implemented, and is being used
to generate bigrams for a speech recognizer that is part of the
BeRP spoken-language system [Jurafsky et al.
1994].
An early prototype of BeRP was used in an experiment to
assess the benefit of using bigram probabilities obtained through
SCFGs versus estimating them directly from the available training corpus.
The system's domain are inquiries about restaurants in the city
of Berkeley.
The training corpus used had only 2500 sentences, with an average
length of about 4.8 words/sentence.
Our experiments made use of a
context-free grammar hand-written for the BeRP domain.
With 1200 rules and a vocabulary of 1100 words, this grammar was able
to parse 60% of the training corpus.
Computing the bigram probabilities from this SCFG
takes about 24 hours on a SPARCstation 2-class machine.
In experiment 1, the recognizer used bigrams that were estimated directly from the training corpus, without any smoothing, resulting in a word error rate of 35.1%. In experiment 2, a different set of bigram probabilities was used, computed from the context-free grammar, whose probabilities had previously been estimated from the same training corpus, using standard EM techniques. This resulted in a word error rate of 35.3%. This may seem surprisingly good given the low coverage of the underlying CFGs, but notice that the conversion into bigrams is bound to result in a less constraining language model, effectively increasing coverage.
Finally, in experiment 3, the bigrams generated from the SCFG
were augmented by those from the raw training data, in a proportion of
200,000 : 2500.
We have not attempted to optimize this mixture proportion,
e.g., by deleted interpolation [Jelinek and Mercer1980].
With the bigram estimates thus obtained, the word error rate dropped to
33.5%. (All error rates were measured on a separate test corpus.)
The experiment therefore supports our earlier argument that more sophisticated language models, even if far from perfect, can improve n-gram estimates obtained directly from sample data.