Vision-Language Models Do Not Understand Negation

Vision-Language Models Do Not Understand Negation

21 Lượt nghe
Vision-Language Models Do Not Understand Negation
This paper explores a significant limitation in current Vision-Language Models (VLMs), such as CLIP, revealing that they often fail to understand negation in natural language queries. This poses a problem for applications where users need to search for images or videos based on what is *not* present. To evaluate this, the researchers created **NegBench**, a new benchmark with over 79,000 examples across image, video, and medical datasets, featuring tasks like retrieving images with negated descriptions or answering multiple-choice questions with negated options. Their evaluation found that **modern VLMs struggle significantly with negation, frequently performing poorly and exhibiting an "affirmation bias"** where they fail to distinguish between affirmative and negated statements. Even larger or newer models did not overcome this limitation. As a potential solution, the study proposes a data-centric approach: **fine-tuning models on large synthetic datasets containing millions of negated captions** designed to teach models how to handle negation. This method demonstrated substantial improvements, leading to a 10% increase in recall for negated queries and a 28% boost in accuracy on negation-based multiple-choice questions for CLIP-based models. https://arxiv.org/pdf/2501.09425