Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper illustrated)

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper illustrated)

31.553 Lượt nghe
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (paper illustrated)
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Paper Abstract: This paper presents a new vision Transformer, calledSwin Transformer, that capably serves as a general-purposebackbone for computer vision.Challenges in adaptingTransformer from language to vision arise from differencesbetween the two domains, such as large variations in thescale of visual entities and the high resolution of pixels inimages compared to words in text. To address these differ-ences, we propose a hierarchical Transformer whose rep-resentation is computed with shifted windows. The shiftedwindowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windowswhile also allowing for cross-window connection. This hi-erarchical architecture has the flexibility to model at var-ious scales and has linear computational complexity withrespect to image size. These qualities of Swin Transformermake it compatible with a broad range of vision tasks,including image classification (86.4 top-1 accuracy onImageNet-1K) and dense prediction tasks such as object de-tection (58.7 box AP and 51.1 mask AP on COCO test-dev)and semantic segmentation (53.5 mIoU on ADE20K val).Its performance surpasses the previous state-of-the-art by alarge margin of +2.7 box AP and +2.6 mask AP on COCO,and +3.2 mIoU on ADE20K, demonstrating the potential ofTransformer-based models as vision backbones. Paper Link: https://arxiv.org/pdf/2103.14030.pdf Official Code: https://github.com/microsoft/Swin-Transformer Video Outline: 0:00​ - Introduction 0.50​ - Backbone for vision tasks 2:21​ - Swin Transformer - Architecture 5:31​ - Multi-Headed Self Attention (MSA) 6:51 - Swin Transformer Block 7:15 - Shifted Windows 8:40 - Swin Architecture and variants 9:45 - Results **AI Bites** YouTube: https://www.youtube.com/c/AIBites​ Twitter: https://twitter.com/ai_bites​ Patreon: https://www.patreon.com/ai_bites​ Github: https://github.com/ai-bites​ Vision Transformers (ViT): https://youtu.be/3B6q4xnuFUE Data Efficient Image Transformer (DeiT): https://youtu.be/HobIo2oT0xY 📚 📚 📚 BOOKS I HAVE READ, REFER AND RECOMMEND 📚 📚 📚 📖 Deep Learning by Ian Goodfellow - https://amzn.to/3Wnyixv 📙 Pattern Recognition and Machine Learning by Christopher M. Bishop - https://amzn.to/3ZVnQQA 📗 Machine Learning: A Probabilistic Perspective by Kevin Murphy - https://amzn.to/3kAqThb 📘 Multiple View Geometry in Computer Vision by R Hartley and A Zisserman - https://amzn.to/3XKVOWi Music: https://www.bensound.com