"Using Relative Frequencies of Phone Bigrams as Features for Speaker ID"
I'll be discussing a research framework where counts of acoustic state occupancies are used as features for performing speaker ID. In my current system, these "acoustic state occupancies" are simply relative frequencies of phone bigrams, which are obtained by running an open-loop phone decoding on each input conversation side. The relative frequencies are used as features for training various types of speaker models, including models based on support vector machines (SVMs). In my talk, I'll be focusing on some recent experiments which have yielded large gains over the previous state-of-the-art in phone-based modeling. I will also discuss some new work on optimizing SVMs for the purposes of training phone-based speaker models.