ICSI SRE 2004 System Description ------------------------------------- The submitted system is an implementation of that described in [1]. Please refer to the paper for greater detail. The system uses background and target keyword models generated by Hidden Markov Models (HMMs) for 19 select keywords (where a "keyword" is a word or common word-pair) drawn primarily from the discourse marker, backchannel and filled pause categories. The system employs speaker-independent keyword-specific HMMs which are then adapted to the target training data to create target models, and computes test scores using the usual likelihood ratio of target to background. Only intervals of speech corresponding to the 19 keywords are scored. This system is intended to serve as a contrast to a submission by SRI which combines scores from this system with those of a more conventional frame-based GMM system. Feature Extraction ==================== The HMM feature vectors consist of 19 mel cepstra, the zeroth cepstrum, and their first differences, for a total of 40 features per vector. Cepstral Mean Subtraction was performed over the union of speech-rich segments for each conversation side (segmentation provided by SRI). Background Model ==================== Keyword UBMs were obtained by training on data from the SRE 2003 Extended Data set (Switchboard II, phases 2 and 3), using splits 6 through 10. We anticipate mismatch with the evaluation data, e.g. due to the absence of cellular data and the limited demographic distribution in this set, and intend to examine the effects of using background models from a better matched training set after the evaluation. The keyword HMMs were simple left-to-right state sequences with self-loops and no skips. Each state model consisted of a mixture of four gaussians and the number of states for each keyword was defined to be the smaller of the number of phones in the standard pronunciation of the word, multiplied by 3, and the median duration in frames, divided by four. All modeling and scoring was performed using the HMM Toolkit, HTK. Training ==================== Speaker-specific keyword models were obtained by MAP adaptation of the background models by adapting only the means of the gaussians. In the event that there is no training data for a particular keyword, the UBM is simply copied as the speaker-specific model. This results in removing the influence of the keyword, as the contribution to the overall score is zero, due to the cancellation of target and background. Keyword locations within the audio file were determined by word-level alignment information made available from SRI's automatic speech recognition (ASR) system. (See their submission for more detail on the speech recognizer.) Testing ==================== Each keyword appearing in the test segment is scored by taking the difference between the log probabilities obtained from scoring the speaker-specific and UBM models against the test tokens. The final score is obtained by adding these keyword scores and normalizing by the total number of frames. Again keyword locations were obtained from SRI's ASR output. *Note: For trials involving non-English test segments or having no English adaptation data, a score was not computed since no ASR output was provided by SRI for these files. Instead, a dummy value of '0.0' was assigned to the trial. Score Normalization ==================== Standard normalizations such as ZNORM and TNORM have not been included in this system owing to time constraints, though it is known they may be able to assist in mitigating the effects of acoustic mismatch. We intend to examine this following the evaluation. Computation ==================== All computation was performed on a fleet of Intel 2.8GHz Xeon Processors with 2GB of memory The processing times are as follows: Feature Extraction ---------------------- total elapsed: 302068.00s total cputime: 33296.20s 1-side/1-side Scoring ---------------------- Training time total elapsed: 6123.93s total cputime: 204.64s Testing time total elapsed: 130014.00s total cputime: 25489.00s 8-sides/1-side Scoring ---------------------- Training time total elapsed: 8436.58s total cputime: 440.54s Testing time total elapsed: 53045.90s total cputime: 16865.00s *Note:These times do not include the processing required to generate the ASR output used to locate the keywords. This information can be found in the SRI system description. Acknowledgments ==================== We would like to thank SRI for their assistance in this evaluation, esp. for providing the ASR output used in this submission. References ==================== [1] K. Boakye and B. Peskin, "Text-Constrained Speaker Recognition on a Text-Independent Task," Odyssey 2004: The Speaker and Language Recognition Workshop, Toledo, Spain, May/June 2004.