Insights into Audio-Based Multimedia Event Classification with Neural Networks
Title | Insights into Audio-Based Multimedia Event Classification with Neural Networks |
Publication Type | Conference Paper |
Year of Publication | 2015 |
Authors | Ravanelli, M., Elizalde B. Martinez, Bernd J., & Friedland G. |
Page(s) | 19-23 |
Other Numbers | 3822 |
Abstract | Multimedia Event Detection (MED) aims to identify events-also called scenes-in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification, and discuss some insights towards more effectively using this paradigm for MED. We explore different architectures, in terms of number of layers and number of neurons. We also assess the performance impact of pre-training with Restricted Boltzmann Machines (RBMs) in contrast with random initialization, and explore the effect of varying the context window for the input to the NNs. Lastly, we compare the performance of Hidden Markov Models (HMMs) with a discriminative classifier for the event classification. We used the publicly available event-annotated YLI-MED dataset. Our results showed a performance improvement of more than 6% absolute accuracy compared to the latest results reported in the literature. Interestingly, these results were obtained with a single-layer neural network with random initialization, suggesting that standard approaches with deep learning and RBM pre-training are not fully adequate to address the high-level video event-classification task. |
Acknowledgment | This work was supported in part by the National Science Foundationunder Award IIS : 1251276 (SMASH: Scalable Multimedia contentAnalysiS in a High-level language), and by Lawrence LivermoreNational Laboratory, operated by Lawrence Livermore NationalSecurity, LLC, for the U.S. Department of Energy, NationalNuclear Security Administration, under Contract DE-AC52-07NA27344.We also gratefully acknowledge the support of NVIDIA Corporationwith the donation of a Tesla K40 GPU used for this research.Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of the funders. |
URL | https://www.icsi.berkeley.edu/pubs/multimedia/insightsaudio15.pdf |
Bibliographic Notes | Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions (MMCommons '15), Brisbane, Australia, pp. 19-23 |
Abbreviated Authors | M. Ravanelli, B. Elizalde, J. Bernd, and G. Friedland |
ICSI Research Group | Audio and Multimedia |
ICSI Publication Type | Article in conference proceedings |