DCAR: A Discriminative and Compact Audio Representation for Audio Processing

TitleDCAR: A Discriminative and Compact Audio Representation for Audio Processing
Publication TypeJournal Article
Year of Publication2017
AuthorsJing, L., Liu B., Choi J., Janin A., Bernd J., Mahoney M. W., & Friedland G.
Published in IEEE Transactions on Multimedia

This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events and scenes in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on a Grassmannian manifold. The learned components can effectively represent the structure of audio. Our experiments used the YLIMED and DCASE Acoustic Scenes datasets. The results show that variants on the proposed DCAR representation consistently outperform four popular audio representations (mv-vector, ivector, GMM, and HEM-GMM). The advantage is significant for both easier and harder discrimination tasks; we discuss how these performance differences across tasks follow from how each type of model leverages (or doesn’t leverage) the intrinsic structure of the data.


This work was partially supported by the NSFC (61370129, 61375062, 61632004), the PCSIRT (Grant IRT201206), the grants from Science and Technology Bureau of Baoding City (No. 16ZG026), and a collaborative Laboratory Directed Research & Development grant led by Lawrence Livermore National Laboratory (U.S. Dept. of Energy contract DEAC52-07NA27344). (Any findings and conclusions are those of the authors, and do not necessarily reflect the views of the funders.) We are grateful to the anonymous TMM and ACMMM reviewers who made many helpful suggestions for improving this paper

ICSI Research Group

Audio and Multimedia