"A Computational Hidden Dynamic Model of Speech Coarticulation"
John S Bridle & Hywel B Richards
Dragon Systems UK, Cheltenham, United Kingdom

    We are all familiar with the idea that behind the directly-measurable acoustic patterns of speech there are moving parts (whether these parts are formants, area functions, articulators or muscle-groups) and that these parts are somehow under the influence of some linguistically-relevant symbols that occur in sequence. Our goal is a model of speech patterns that uses a small set of slowly-moving parameters, and in so doing deals naturally with at least simple transitions and coarticulation. Potential application areas include speech synthesis and automatic speech recognition.
    We present a computational model of the relationship between phones sequences and spectrograms that involves an intermediate representation roughly equivalent to formant frequencies or articulator positions. Each phone type is characterized by one or more target vectors in this "latent phonetic space", plus time-constants that control simple low-pass linear dynamics applied to the target sequence to produce a "hidden dynamic state". The relationship between this state and the log power spectrum is an instantaneous nonlinear function modeled by an MLP ("multi-layer perceptron").  All the parameters of the system (MLP weights, targets and time-constants) can be learnt by optimizing the fit of synthetic patterns (the model's output given an aligned phonetic transcription) to real speech spectrograms. The first derivative of the squared error with respect to all the parameters can be computed quite easily, and gradient descent methods can then be applied.
    We find that the system is able to learn interesting representations quite easily.  For some purposes a few minutes of speech are sufficient.  There is no need for fancy initialization methods (zeros and small random numbers work), but it is certainly possible to initialize or fix some parts of the system to make it conform to particular ideas.  For example a system with a six-dimensional hidden space can have three of them fixed to be formant frequencies according to a favorite synthesis-by-rule system, and the other dimensions will "take up the slack". Or the MLP can be fixed to be an approximation to a known relationship between articulator positions and spectrum shapes, and the system will learn phonetic targets and dynamics.
    We currently use only a very simple idea of phonetic segments, but an extension could deal in features, perhaps with some asynchrony.
    This work can be seen as a special case of an approach first described by Raimo Bakis in 1991.  The work of Li Deng is also relevant.  The most recent development was in a WS98 project at CLSP, Johns Hopkins University in Summer 1998.