Multimodal City-Identification on Flickr Videos Using Acoustic and Textual Features

TitleMultimodal City-Identification on Flickr Videos Using Acoustic and Textual Features
Publication TypeTechnical Report
Year of Publication2012
AuthorsLei, H., Choi J., & Friedland G.
Other Numbers3301

We have performed city-verification of videos based on the videos' audio and metadata, using videos from the MediaEval Placing Task's video set, which contain consumer-produced videos “from-the-wild.” Eighteen cities were used as targets, for which acoustic and language models were trained, and against which test videos were scored. We have obtained the first known results for the city verification task, with an EER minimum of 21.8 percent. This result is well above-chance, even though the videos contain very few city-specific audio and metadata features. We have also demonstrated the complementarity of audio and metadata for this task.


This research is supported by NGA NURI grant number HM11582-10-1-0008, NSF EAGER grant IIS-1138599, and NSF Award CNS-1065240. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

Bibliographic Notes

ICSI Technical Report TR-12-007

Abbreviated Authors

H. Lei, J. Choi, and G. Friedland

ICSI Research Group

Audio and Multimedia

ICSI Publication Type

Technical Report