City-Identification on Flickr Videos Using Acoustic Features

TitleCity-Identification on Flickr Videos Using Acoustic Features
Publication TypeTechnical Report
Year of Publication2011
AuthorsLei, H., Choi J., & Friedland G.
Other Numbers3077

This article presents an approach that utilizes audio to discriminate the city of origin of consumer-producedvideos – a task that is hard to imagine even for humans. Using a sub-set of the MediaEvalPlacing Task's Flickr video set, we conducted an experiment with a setup similar to a typical NISTspeaker recognition evaluation run. Our assumption is that the audio within the same city might bematched in various ways, e.g., language, typical environmental acoustics, etc., without a singleoutstanding feature being absolutely indicative. Using the NIST speaker recognition framework, a set of18 cities across the world are used as targets, and Gaussian Mixture Models are trained on all targets.Audio from videos of a test set is scored against each of the targets, and a set of scores is obtained forpairs of test set files and target city models. The Equal Error Rate (EER), which is obtained at a scoringthreshold where the number of false alarms equals the misses, is used as the performance measure of


This work was partially supported by funding provided through the National Geospatial-Intelligence Agency University Research Initiatives program (NGA NURI). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of NGA.

Bibliographic Notes

ICSI Technical Report TR-11-001

Abbreviated Authors

H. Lei, J. Choi, and G. Friedland

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Technical Report