Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling

TitleAudio-Based Multimedia Event Detection with DNNs and Sparse Sampling
Publication TypeConference Paper
Year of Publication2015
AuthorsAshraf, K., Elizalde B. Martinez, Iandola F., Moskewicz M., Bernd J., Friedland G., & Keutzer K.
Other Numbers3807

This paper presents advances in analyzing audio content information to detect events in videos, such as a parade or a birthday party. We developed a set of tools for audio processing within the predominantly vision-focused deep neural network (DNN) framework Caffe. Using these tools, we show, for the first time, the potential of using only a DNN for audio-based multimedia event detection. Training DNNs for event detection using the entire audio track from each video causes a computational bottleneck. Here, we address this problem by developing a sparse audio frame-sampling method that improves event-detection speed and accuracy. We achieved a 10 percentage-point improvement in event classification accuracy, with a 200x reduction in the number of training input examples as compared to using the entire track. This reduction in input feature volume led to a 16x reduction in the size of the DNN architecture and a 300x reduction in training time. We applied our method using the recently released YLI-MED dataset and compared our results with a state-of-the-art system and with results reported in the literature for TRECVIDMED. Our results show much higher MAP scores compared to a baseline i-vector system - at a significantly reduced computational cost. The speed improvement is relevant for processing videos on a large scale, and could enable more effective deployment in mobile systems.


This work was partially supported by funding provided to ICSI through National Science Foundation grant IIS : 1251276 (“SMASH -- Scalable Multimedia content AnalysiS in a High-level language”). It wasalso supported by Lawrence Livermore National Laboratory, operated by Lawrence Livermore National Security, LLC, for the U.S. Department of Energy, National Nuclear Security Administration, under Contract DE-AC52-07NA27344. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the NSF, LLNL, or the U.s. Department of Energy.

Bibliographic Notes

Proceedings of the 5th ACM on International Conference on Multimedia Retrieval (ICMR '15), Shanghai, China, pp. 611-614

Abbreviated Authors

K. Ashraf, B. Elizalde, F. Iandola, M. Moskewicz, J. Bernd, G. Friedland, and K. Keutzer

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Article in conference proceedings