https://huggingface.co/papers/2505.04410
Here's a quick rundown:
• The Problem (
0:25-
3:27): Standard CLIP models have limitations with dense prediction tasks because they use a fixed set of categories and struggle with fine-grained details.
• DeCLIP's Solution (
3:56): DeCLIP decouples CLIP's features into content (local discriminability) and context (spatial consistency) features, refining them separately using different guidance.
• How DeCLIP Works (
12:09): It modifies the self-attention mechanism in CLIP, creating separate streams for content and context features. It uses a pre-trained vision foundation model (like Dinov2) for spatial consistency and self-distillation for local discriminability.
• Results (
19:20): DeCLIP achieves state-of-the-art results on open vocabulary detection and segmentation benchmarks, demonstrating significant improvements, especially for recognizing novel objects.