City-Identification on Flickr Videos Using Acoustic Features

This article presents an approach that utilizes audio to discriminate the city of origin of consumer-producedvideos – a task that is hard to imagine even for humans. Using a sub-set of the MediaEvalPlacing Task's Flickr video set, we conducted an experiment with a setup similar to a typical NISTspeaker recognition evaluation run. Our assumption is that the audio within the same city might bematched in various ways, e.g., language, typical environmental acoustics, etc., without a singleoutstanding feature being absolutely indicative. Using the NIST speaker recognition framework, a set of18 cities across the world are used as targets, and Gaussian Mixture Models are trained on all targets.Audio from videos of a test set is scored against each of the targets, and a set of scores is obtained forpairs of test set files and target city models. The Equal Error Rate (EER), which is obtained at a scoringthreshold where the number of false alarms equals the misses, is used as the performance measure of


