Speaker Recognition

Introduction

How do we recognize the speech of those familiar to us? We might call on any of a huge web of inter-related features, including the basic "sound" of a speaker's voice in terms of timbre, pitch, and acoustic spectrum, but we might also be aided by a characteristic accent or a distinctive laugh or turn of phrase. Depending on the context and the acoustic conditions, certain features may be particularly valuable -- recognizing the melody of a speaker's voice even if the words can't be distinguished, or the choice and articulation of an opening greeting when picking up the telephone, or even the language from a written transcription of a meeting when the acoustics are not available. Clearly we as humans draw on a number of different types of information at a number of different levels, thus providing us with a singularly robust and adaptive mechanism for identifying the speakers we know.

Yet, most automatic speaker recognition systems today rely entirely on low-level acoustic features, extracted from the speech signal every 10-20 milliseconds and encoded in a series of frames generally modeled as independent events without recourse to temporal evolution (beyond, potentially, simple local difference parameters). The goal of this project is to explore higher level features (prosodic patterns, pronunciation preferences, word usage, speaker idiosyncrasies, etc.) to aid in recognizing and distinguishing between speakers.

The research plan consists of two main "feature discovery" tracks: one focused on the exploration of features motivated by existing linguistic constructs and expert-guided feature extraction, the other on the purely data-driven discovery of characteristic speaker "performances" as sequences in spectro-temporal space, independent of such linguistic constructs. We believe that this powerful pairing between low-risk mining of expert-guided features and highly-exploratory, higher risk data-driven feature discovery provides the best framework for this research, providing a natural contrast for assessing progress and an opportunity to combine two very different families of features to enable improved overall performance.

For local project pages click here (you must be on an ICSI machine or know the password).