Frank Seide
Microsoft Research Asia
Tuesday, October 4, 2011
12:30pm - 1:30pm
We apply the recently proposed Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to speech-to-text transcription. For single-pass speaker-independent recognition on the RT03S Fisher portion of phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 27.4 percent, obtained by discriminatively trained Gaussian-mixture HMMs, to 18.5 percent - a 33 percent relative improvement.