Figure 1. Sample FETEX and PHOM networks. In the graph above each FETEX
and PHOM network, the red piecewise linear curves represent the training
target response, and the green curves are the possible outputs by trained
networks.
Target Specification: The target responses for the FETEX and PHOM networks are assumed to be piecewise linear function spanning the entire duration of the phone, shown in Figure 1 as "trapezoids" in red in the graphs above each network. Figure 2 shows a more detailed sample target response and possible outputs. Each "trapezoid" in the target responses can be determined by five parameters:
Note that the desired objective is to have the network respond in a positive and consistent fashion to certain features of the input signal, and the piecewise target function provides only a hint of the "ideal" network response. In the past, we have applied this strategy of specifying an under-determined target to other applications of Temporal Flow Model (TFM) networks. To further explain the idea of target specification, we present one sample from each of the following two tasks, syllabic segmentation, and single wordspotting. In both cases, the networks were successfully trained to generalize beyond the simple training target responses.
Sample 1 - Syllabic Segmentation: In this task, a TFM neural network was trained to respond to the presence of each syllable in an input speech signal stream. Figure 3 shows the target response for one training sentence, which contains five syllables. For each syllable, we obtained the onset and ending time from the syllabic transcription, and specified a Gaussian curve centered at the middle of the syllable, with a standard deviation of one quarter of the syllable duration. The Gaussians were scaled to have similar peak heights. Again, the target response was only providing a hint of the "ideal" network response, and may be replaced with different curves of similar shape. Figure 4 shows the actual output for the same sentence from a trained network, plotted over the target response.
Figure 3. Sample network target response for a syllable segmentation
task. The red curves are the target response Gaussian curves.
The blue dotted lines are the true syllable boundaries obtained from the
syllabic transcription.
Figure 4. Sample network outputs and target response for
the same sentence as in Figure 3. The green curves are the actual
outputs obtained from a trained TFM network.
Sample 2- Single Wordspotting:
Each TFM network in the single wordspotting task was trained to response
only to a specific word in the input speech stream. For example,
a network trained for the word "three" should only have significant output
activation during the presence of a "three" in the input speech stream.
Note that the target response for a TFM is by no mean limited to piecewise
linear or Gaussian curves. In the wordspotting task, we found it
more appropriate to use target response whose shape was composed of two
back-to-back sigmoid curves, as shown in the sample in Figure
5. The first sigmoid for each target word starts at the onset
of the word, and rises smoothly until near the ending of the word.
The second (inverted) sigmoid starts from the ending of the word and drops
sharply to minimum. Figure 6 shows the actual
output for the same sentence produced by a trained network, plotted over
the target response.
Figure 5. Sample network target response for a wordspotting task.,
for recognizing the word "three". The two red curves are the target
responses, and each is composed of two back-to-back sigmoid curves.
The blue lines indicate the true boundaries of the word "three"s, obtained
from transcription.
Figure 6. Sample network outputs and target response to the same
sentence as in Figure 5. The green curves are the actual trained
network output.