FastV: An Image is Worth 1/2 Tokens After Layer 2

FastV: An Image is Worth 1/2 Tokens After Layer 2

597 Lượt nghe
FastV: An Image is Worth 1/2 Tokens After Layer 2
Large Vision-Language Models (LVLMs) tackle question answering in images and videos, but they struggle to efficiently utilize attention, resulting in high computational costs during inference. The FastV paper introduces a method to prune up to 50% of image tokens while maintaining comparable performance to the original model. Paper link: https://arxiv.org/pdf/2403.06764.pdf Table of Content: 00:00 Intro 00:43 LVLMs network architecture 01:46 Attention analysis 06:00 FastV 08:48 Results Icon made by Freepik from flaticon.com