The task of speaker diarization is to segment and cluster a speech recording into speaker homogenous regions, i.e. given an audio track of a meeting, the system has to discriminate and label the different speakers automatically ("who spoke when?"), this includes tasks like “when is there speech?” (speech/non-speech detection) and “who is overlapping with whom?” (overlap detection and resolution).
Currently, we are trying to improve the efficiency of current approaches as well as creating online algorithms ("who is speaking now?"). In addition, we are also exploring different applications on top of diarization, such inferring behavioral categories of a person according to speaking length and/or interruptions or semantic navigation in TV shows (see above).
