Coherent Multi-Sentence Video Description with Variable Level of Detail

TitleCoherent Multi-Sentence Video Description with Variable Level of Detail
Publication TypeConference Paper
Year of Publication2014
AuthorsRohrbach, A., Rohrbach M., Qiu W., Friedrich A., Pinkal M., & Schiele B.
Other Numbers3748

Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptionsÂ’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work.


This work was partially supported by a postdoctoral fellowship funded by the Federal Ministry of Education and Research (BMBF) through the FITweltweit program, administered by the German Academic Exchange Service (DAAD).

Bibliographic Notes

Proceedings of the 36th German Conference on Pattern Recognition (GCPR 2014), Muenster, Germany. Reprinted in Pattern Recognition, Lecture Notes in Computer Science, Vol. 8753, pp. 184-195

Abbreviated Authors

A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele

ICSI Research Group


ICSI Publication Type

Article in conference proceedings