| |
Multimodal Speaker Diarization and Localization
Gerald Friedland
ICSI
Tuesday, November 25, 2008
12:30
Research in cognitive psychology suggests that the human brain is able to
integrate different sensory modalities, such as sight, sound, and touch,
into a perceptual experience that is coherent and unified. Experiments
show that by considering input from multiple sensors, perceptual problems
can be solved more robustly and even more efficiently. In computer science,
however, synergistic use of data encoded for different sensory modalities
has not always lived up to its promise.
This talk presents speaker diarization as an example of a multimedia content
analysis task where the integrated use of video and audio information
is beneficial. Traditionally, speaker diarization tries to automatically identify speakers from a single-source audio track with the goal of answering the question "who spoke when". Incorporating the information from a low-resolution video camera not only improves the accuracy of the ICSI speaker diarization engine significantly, the talk also presents how the same engine can be used to localize the speakers as a side-effect, thus extending the questions answered by the approach to "who spoke when and from where".
|
|