Audio-Concept Features and Hidden Markov Models for Multimedia Event Detection

TitleAudio-Concept Features and Hidden Markov Models for Multimedia Event Detection
Publication TypeConference Paper
Year of Publication2014
AuthorsElizalde, B. Martinez, Ravanelli M., Ni K., Borth D., & Friedland G.
Other Numbers3732

Multimedia event detection (MED) on user-generated content is the task of finding an event, e.g., aFlash moborAttempting a bike trick, using its content characteristics. Recentresearch has focused on approaches that use semantically defined “concepts” trained with annotated audio clips. Using audio concepts allows us to show semantic evidence of their relationship to events, by looking at the probability distribution ofthe audio concepts per event. However, while the concept-basedapproach has been useful in image detection, audio conceptshave generally not surpassed the performance of low-level audio features like Mel Frequency Cepstral Coefficients (MFCCs)in addressing the unstructured acoustic composition of videoevents. Such audio-concept based systems could benefit fromtemporal information, due to one of the intrinsic characteristicsof audio: it occurs across a time interval. This paper presentsa multimedia event detection system that uses audio concepts;it exploits the temporal correlation of audio characteristics foreach particular event at two levels. The first level involves analyzing the short- and long-term surrounding context information for the audio concepts, through an implementation of a Hierarchical Deep Neural Network (H-DNN), to determine engineered audio-concept features. At the second level, we use Hidden Markov Models (HMMs) to describe the continuous andnon-stationary characteristics of the audio signal throughout thevideo. Experiments using the TRECVID MED 2013 corpusshow that an HMM system based on audio-concept features canperform competitively when compared with an MFCC-basedsystem.


This work was partially supported by funding provided to ICSI by the Intelligence Advanced Research Projects Activity (IARPA) via Department of the Interior National BusinessCenter contract number D11PC20066. The U.S. government isauthorized to reproduce and distribute reprints for governmental purposes, notwithstanding any copyright annotation thereon.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representingthe official policies or endorsement, either expressed or implied,of IARPA, DOI/NBC, or the U.S. government. Addition support was provided by Lawrence Livermore National Laboratory, which is operatedby Lawrence Livermore National Security, LLC, for the U.S.Department of Energy, National Nuclear Security Administration, under Contract DE-AC52-07NA27344.

Bibliographic Notes

Proceedings of the Interspeech Workshop on Speech, Language and Audio in Multimedia (SLAM 2014), Penang, Malaysia

Abbreviated Authors

B. Elizalde, M. Ravanelli, K. Ni, D. Borth, and G. Friedland

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Article in conference proceedings