We are all familiar with the idea that behind the
directly-measurable acoustic patterns of speech there are moving parts
(whether these parts are formants, area functions, articulators or muscle-groups)
and that these parts are somehow under the influence of some linguistically-relevant
symbols that occur in sequence. Our goal is a model of speech patterns
that uses a small set of slowly-moving parameters, and in so doing deals
naturally with at least simple transitions and coarticulation. Potential
application areas include speech synthesis and automatic speech recognition.
We present a computational model of the relationship
between phones sequences and spectrograms that involves an intermediate
representation roughly equivalent to formant frequencies or articulator
positions. Each phone type is characterized by one or more target vectors
in this "latent phonetic space", plus time-constants that control simple
low-pass linear dynamics applied to the target sequence to produce a "hidden
dynamic state". The relationship between this state and the log power spectrum
is an instantaneous nonlinear function modeled by an MLP ("multi-layer
perceptron"). All the parameters of the system (MLP weights, targets
and time-constants) can be learnt by optimizing the fit of synthetic patterns
(the model's output given an aligned phonetic transcription) to real speech
spectrograms. The first derivative of the squared error with respect to
all the parameters can be computed quite easily, and gradient descent methods
can then be applied.
We find that the system is able to learn interesting
representations quite easily. For some purposes a few minutes of
speech are sufficient. There is no need for fancy initialization
methods (zeros and small random numbers work), but it is certainly possible
to initialize or fix some parts of the system to make it conform to particular
ideas. For example a system with a six-dimensional hidden space can
have three of them fixed to be formant frequencies according to a favorite
synthesis-by-rule system, and the other dimensions will "take up the slack".
Or the MLP can be fixed to be an approximation to a known relationship
between articulator positions and spectrum shapes, and the system will
learn phonetic targets and dynamics.
We currently use only a very simple idea of phonetic
segments, but an extension could deal in features, perhaps with some asynchrony.
This work can be seen as a special case of an approach
first described by Raimo Bakis in 1991. The work of Li Deng is also
relevant. The most recent development was in a WS98 project at CLSP,
Johns Hopkins University in Summer 1998.