Capturing the Acoustic Scene Characteristics for Audio Scene Detection

TitleCapturing the Acoustic Scene Characteristics for Audio Scene Detection
Publication TypeConference Paper
Year of Publication2013
AuthorsElizalde, B. Martinez, Lei H., Friedland G., & Peters N.
Other Numbers3600

Scene detection on user-generated content (UGC) aims to classify an audio recording that belongs to a specific scene such as busy street, office or supermarket rather than a sound such as car noise, computer keyboard or cash machine. The difficulty of scene content analysis on UGC lies in the lack of structure and acoustic variability of the audio. The i-vector system is state-of-the-art in Speaker Verification and Scene Detection, and is outperforming conventional Gaussian Mixture Model (GMM)-based approaches. The system compensates for undesired acoustic variability and ex- tracts information from the acoustic environment, making it a meaningful choice for detection on UGC. This paper reports our results in the challenge by using a hand-tuned i-vector system and MFCC features on the IEEE-AASP Scene Classification Challenge dataset. Compared to the MFCC+GMM baseline system, our approach increased the classification accuracy by 26.4% relative, to 65.8%. We discuss our approach and highlight parameters in our system that significantly improved our classification accuracy.


This work was partially supported by funding provided to ICSI by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusion contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsement, either expressed or implied, of IARPA, DOI/NBC, or the U.S. Government.

Bibliographic Notes

Proceedings of the IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (D-CASE) at the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2013), New Paltz, New York

Abbreviated Authors

B. Elizalde, H. Lei, G. Friedland, and N. Peters

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Article in conference proceedings