On the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval

TitleOn the Applicability of Speaker Diarization to Audio Concept Detection for Multimedia Retrieval
Publication TypeConference Paper
Year of Publication2011
AuthorsMertens R, Huang P-S, Gottlieb L, Friedland G, Divakaran A
Page(s)446-451
Other Numbers3201
KeywordsAudio Clustering, Audio Indexing, Speaker Diarization, Video Indexing
Abstract

Recently, audio concepts emerged as a usefulbuilding block in multimodal video retrieval systems. Informationlike ”this file contains laughter”, ”this file contains enginesounds” or ”this file contains slow music” can significantlyimprove purely visual based retrieval. The weak point ofcurrent approaches to audio concept detection is that theyheavily rely on human annotators. In most approaches, audiomaterial is manually inspected to identify relevant concepts.Then instances that contain examples of relevant conceptsare selected – again manually – and used to train conceptdetectors. This approach comes with two major disadvantages:(1) it leads to rather abstract audio concepts that hardly coverthe audio domain at hand and (2) the way human annotatorsidentify audio concepts likely differs from the way a computeralgorithm clusters audio data – introducing additional noisein training data. This paper explores whether unsupervizedaudio segementation systems can be used to identify usefulaudio concepts by analyzing training data automatically andwhether these audio concepts can be used for multimediadocument classification and retrieval. A modified version ofthe ICSI (International Computer Science Institute) speakerdiarization system finds segments in an audio track that havesimilar perceptual properties and groups these segments. Thisarticle provides an in-depth analysis on the statistic propertiesof similar acoustic segments identified by the diarization systemin a predefined document set and the theoretical fitness of thisapproach to discern one document class from another.

Acknowledgment

This work was partially supported by funding provided to ICSI by the Intelligence Advanced Research Projects Agency (IARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of IARPA or of the U.S. Government.

URLhttp://www.icsi.berkeley.edu/pubs/speech/applicabilityof12.pdf
Bibliographic Notes

Proceedings of the IEEE International Symposium on Multimedia (ISM 2011), Dana Point, California, pp. 446-451

Abbreviated Authors

R. Mertens, P.-S. Huang, L. Gottlieb, G. Friedland, and A. Divakaran

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Article in conference proceedings