Publication Details
Title: City-Identification on Flickr Videos Using Acoustic Features
Author: H. Lei, J. Choi, and G. Friedland
Group: ICSI Technical Reports
Date: April 2011
PDF: http://www.icsi.berkeley.edu/pubs/techreports/TR-11-001.pdf
Overview:
This article presents an approach that utilizes audio to discriminate the city of origin of consumer-produced videos – a task that is hard to imagine even for humans. Using a sub-set of the MediaEval Placing Task's Flickr video set, we conducted an experiment with a setup similar to a typical NIST speaker recognition evaluation run. Our assumption is that the audio within the same city might be matched in various ways, e.g., language, typical environmental acoustics, etc., without a single outstanding feature being absolutely indicative. Using the NIST speaker recognition framework, a set of 18 cities across the world are used as targets, and Gaussian Mixture Models are trained on all targets. Audio from videos of a test set is scored against each of the targets, and a set of scores is obtained for pairs of test set files and target city models. The Equal Error Rate (EER), which is obtained at a scoring threshold where the number of false alarms equals the misses, is used as the performance measure of our system. We obtain an EER of 32.3% on a test set with no common users in the training set. We obtain a minimum EER of 22.1% on a test set with common users in the training set. The experiments show the feasibility of using implicit audio cues (as opposed to building explicit detectors for individual cues) for location estimation of consumer-produced “from-the-wild” videos. Since audio is likely complementary to other modalities useful for the task, such as video or metadata, the presented results can be used in combination with results from other modalities.
Acknowledgements:
This work was partially supported by funding provided through the National Geospatial-Intelligence Agency University Research Initiatives program (NGA NURI). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of NGA.
Bibliographic Information:
ICSI Technical Report TR-11-001
Bibliographic Reference:
H. Lei, J. Choi, and G. Friedland. City-Identification on Flickr Videos Using Acoustic Features. ICSI Technical Report TR-11-001, April 2011
Author: H. Lei, J. Choi, and G. Friedland
Group: ICSI Technical Reports
Date: April 2011
PDF: http://www.icsi.berkeley.edu/pubs/techreports/TR-11-001.pdf
Overview:
This article presents an approach that utilizes audio to discriminate the city of origin of consumer-produced videos – a task that is hard to imagine even for humans. Using a sub-set of the MediaEval Placing Task's Flickr video set, we conducted an experiment with a setup similar to a typical NIST speaker recognition evaluation run. Our assumption is that the audio within the same city might be matched in various ways, e.g., language, typical environmental acoustics, etc., without a single outstanding feature being absolutely indicative. Using the NIST speaker recognition framework, a set of 18 cities across the world are used as targets, and Gaussian Mixture Models are trained on all targets. Audio from videos of a test set is scored against each of the targets, and a set of scores is obtained for pairs of test set files and target city models. The Equal Error Rate (EER), which is obtained at a scoring threshold where the number of false alarms equals the misses, is used as the performance measure of our system. We obtain an EER of 32.3% on a test set with no common users in the training set. We obtain a minimum EER of 22.1% on a test set with common users in the training set. The experiments show the feasibility of using implicit audio cues (as opposed to building explicit detectors for individual cues) for location estimation of consumer-produced “from-the-wild” videos. Since audio is likely complementary to other modalities useful for the task, such as video or metadata, the presented results can be used in combination with results from other modalities.
Acknowledgements:
This work was partially supported by funding provided through the National Geospatial-Intelligence Agency University Research Initiatives program (NGA NURI). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of NGA.
Bibliographic Information:
ICSI Technical Report TR-11-001
Bibliographic Reference:
H. Lei, J. Choi, and G. Friedland. City-Identification on Flickr Videos Using Acoustic Features. ICSI Technical Report TR-11-001, April 2011
