Speaker Diarization
Speaker diarization consists of segmenting and clustering a speech recording into speaker homogenous regions, so that given an audio track of a meeting the system will discriminate and label the different speakers automatically ("who spoke when?"). This entails speech/non-speech detection ("when is there speech?"), and overlap detection and resolution ("who is overlapping with whom?"). ICSI has a long history of research in this area and has contributed repeatedly to the state of the art. Current research is aiming at improving the robustness and efficiency of current approaches. Research also aims at creating multimodal approaches for video analysis as well as creating online algorithms ("who is speaking now?"). In addition, ICSI is also exploring different applications on top of diarization, such inferring behavioral categories of a person according to speaking length and/or interruptions, or semantic navigation in TV shows.
More details can be found at: http://diarization.icsi.berkeley.edu
