Information on FETEX Network Target Response Specification

The target outputs for training a FETEX network will be derived from phone-level transcription.  Each phonetic feature associated with a phone segment will initially be assumed to be coextensive with the phone (for the TIMIT corpus).  For example, the features "labial", "voiceless" and "obstruent" will be assumed to be "active" during the entire duration of the phone.   Figure 1 is an illustrative diagram of selected FETEX networks and PHOM networks with corresponding training target responses and possible outputs.   For a full size image of the networks, click  here.

Sample FETEX and PHOM Networks. Click to see full size image.
Figure 1. Sample FETEX and PHOM networks. In the graph above each FETEX and PHOM network, the red piecewise linear curves represent the training target response, and the green curves are the possible outputs by trained networks.
 
 

Target Specification:  The target responses for the FETEX and PHOM networks are assumed to be piecewise linear function spanning the entire duration of the phone, shown in Figure 1 as "trapezoids" in red in the graphs above each network.  Figure 2 shows a more detailed sample target response and possible outputs. Each "trapezoid" in the  target responses can be determined by five parameters:

  1. s1, the starting time of the ramp-up (onset);
  2. t1, the ending time of the ramp-up;
  3. s2, the starting time of the ramp-down (offset);
  4. t2, the ending time of the ramp-down;
  5. h, the height of the plateau.
The parameters s1 and t2 can be readily obtained from the phonetic transcriptions, and the t1, s2, and h can be selected experimentally.  Also in each graph, the green curves represent the possible outputs that a trained network might produce.
Detailed Sample Target Response and Network Outputs
Figure 2.  Detailed sample target response and possible output.  The labels s1, t1, s2, t2, and h are defined as above.

Note that the desired objective is to have the network respond in a positive and consistent fashion to certain features of the input signal, and the piecewise target function provides only a hint of the "ideal" network response.  In the past, we have applied this strategy of specifying an under-determined target to other applications of Temporal Flow Model (TFM) networks.  To further explain the idea of target specification, we present one sample from each of the following two tasks, syllabic segmentation, and single wordspotting.  In both cases, the networks were successfully trained to generalize beyond the simple training target responses.

Sample 1 - Syllabic Segmentation:  In this task, a TFM neural network was trained to respond to the presence of each syllable in an input speech signal stream.  Figure 3 shows the target response for one training sentence, which contains five syllables.  For each syllable, we obtained the onset and ending time from the syllabic transcription, and specified a Gaussian curve centered at the middle of the syllable, with a standard deviation of one quarter of the syllable duration. The Gaussians were scaled to have similar peak heights. Again, the target response was only providing a hint of the "ideal" network response, and may be replaced with different curves of similar shape.  Figure 4 shows the actual output for the same sentence from a trained network, plotted over the target response.

Sample Target Response for Syllable Segmentation Task
Figure 3.  Sample network target response for a syllable segmentation task.  The red curves are the target response Gaussian curves.  The blue dotted lines are the true syllable boundaries obtained from the syllabic transcription.
Sample Outputs for Syllable Segmentation Task
 Figure 4.  Sample network outputs and target response for the same sentence as in Figure 3.  The green curves are the actual outputs obtained from a trained TFM network.

Sample 2- Single Wordspotting:  Each TFM network in the single wordspotting task was trained to response only to a specific word in the input speech stream.  For example, a network trained for the word "three" should only have significant output activation during the presence of a "three" in the input speech stream.  Note that the target response for a TFM is by no mean limited to piecewise linear or Gaussian curves.  In the wordspotting task, we found it more appropriate to use target response whose shape was composed of two back-to-back sigmoid curves, as shown in the sample in Figure 5.  The first sigmoid for each target word starts at the onset of the word, and rises smoothly until near the ending of the word.  The second (inverted) sigmoid starts from the ending of the word and drops sharply to minimum.  Figure 6 shows the actual output for the same sentence produced by a trained network, plotted over the target response.
Sample Target Response for Wordspotting Task
Figure 5.  Sample network target response for a wordspotting task., for recognizing the word "three".  The two red curves are the target responses, and each is composed of two back-to-back sigmoid curves.  The blue lines indicate the true boundaries of the word "three"s, obtained from transcription.
Sample Network Outputs for Wordspotting Task
Figure 6.  Sample network outputs and target response to the same sentence as in Figure 5.  The green curves are the actual trained network output.



 Back to top