Gedas Bertasius - Video Understanding with Modern Language Models
March 30, 2021. MIT, CSAIL
Abstract:
Humans understand the world by processing signals from both vision and language. Similarly, we believe that language can be useful for developing better video understanding systems. In this talk, I will present several video understanding frameworks that incorporate models from the language domain.
First, I will introduce TimeSformer, the first convolution-free architecture for video modeling built exclusively with self-attention. It achieves the best reported numbers on major action recognition benchmarks, and it is also more efficient than the state-of-the-art 3D CNNs. Afterwards, I will present COBE, a new large-scale framework for learning contextualized object representations in settings involving human-object interactions. Our approach exploits automatically-transcribed speech narrations from instructional YouTube videos, and it does not require manual annotations. Lastly, I will introduce a multi-modal video-based text generation framework Vx2Text, which outperforms state-of-the-art on three video based text-generation tasks: captioning, question answering and dialoguing.
Bio :
Gedas Bertasius is a postdoctoral researcher at Facebook AI working on computer vision and machine learning problems. His current research focuses on topics of video understanding, first-person vision, and multi-modal deep learning. He received his Bachelors Degree in Computer Science from Dartmouth College, and a Ph.D. in Computer Science from the University of Pennsylvania. His recent work was nominated for the CVPR 2020 best paper award.