Large Vision-Language Models (LVLMs) tackle question answering in images and videos, but they struggle to efficiently utilize attention, resulting in high computational costs during inference. The FastV paper introduces a method to prune up to 50% of image tokens while maintaining comparable performance to the original model.
Paper link: https://arxiv.org/pdf/2403.06764.pdf
Table of Content:
00:00 Intro
00:43 LVLMs network architecture
01:46 Attention analysis
06:00 FastV
08:48 Results
Icon made by Freepik from flaticon.com