Coherent Multi-Sentence Video Description with Variable Level of Detail
Title | Coherent Multi-Sentence Video Description with Variable Level of Detail |
Publication Type | Conference Paper |
Year of Publication | 2014 |
Authors | Rohrbach, A., Rohrbach M., Qiu W., Friedrich A., Pinkal M., & Schiele B. |
Volume | 8753 |
Page(s) | 184-195 |
Other Numbers | 3748 |
Abstract | Humans can easily describe what they see in a coherent way and at varying level of detail. However, existing approaches for automatic video description focus on generating only single sentences and are not able to vary the descriptionsÂ’ level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. To understand the difference between detailed and short descriptions, we collect and analyze a video description corpus of three levels of detail. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from it. For our multi-sentence descriptions we model across-sentence consistency at the level of the SR by enforcing a consistent topic. Human judges rate our descriptions as more readable, correct, and relevant than related work. |
Acknowledgment | This work was partially supported by a postdoctoral fellowship funded by the Federal Ministry of Education and Research (BMBF) through the FITweltweit program, administered by the German Academic Exchange Service (DAAD). |
URL | http://www.icsi.berkeley.edu/pubs/vision/coherentmulti14.pdf |
Bibliographic Notes | Proceedings of the 36th German Conference on Pattern Recognition (GCPR 2014), Muenster, Germany. Reprinted in Pattern Recognition, Lecture Notes in Computer Science, Vol. 8753, pp. 184-195 |
Abbreviated Authors | A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele |
ICSI Research Group | Vision |
ICSI Publication Type | Article in conference proceedings |