Generating Natural-Language Video Descriptions Using Text-Mined Knowledge

TitleGenerating Natural-Language Video Descriptions Using Text-Mined Knowledge
Publication TypeConference Paper
Year of Publication2013
AuthorsKrishnamoorthy, N., Malkarnenkar G., Mooney R., Saenko K., & Guadarrama S.
Other Numbers3445

We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with "real-world" knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text corpora, enhances the triplet selection algorithm by providing it contextual information and leads to a four-fold increase in activity identification. Unlike previous methods, our approach can annotate arbitrary videos without requiring the expensive collection and annotation of a similar training video corpus. We evaluate our technique against a baseline that does not use text-mined knowledge and show that humans prefer our descriptions 61 percent of the time.


This work was partially supported by funding provided to ICSI by the U.S. Defense Advanced Research Projects Agency (DARPA). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors or originators and do not necessarily reflect the views of DARPA or of the U.S. Government.

Bibliographic Notes

Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-13), Bellevue, Washington

Abbreviated Authors

N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko, and S. Guadarrama

ICSI Research Group


ICSI Publication Type

Article in conference proceedings