This video talks about SWIN transformer - a model trained for image classification, but also used in a variety of tasks as a backbone, replacing ResNet/ViT. It is currently the main part of SOTA object detection models like DINO.
This is another video from my "Modern Object Detection" series: https://www.youtube.com/playlist?list=PL1HdfW5-F8AQlPZCJBq2gNjERTDEAl8v3
Important links:
- Original paper: https://arxiv.org/pdf/2103.14030.pdf
- My previous video about ViT:
https://youtu.be/NcbbPuRjMeE
00:00 - Intro
00:50 - Motivation, "Image Tokenization" Problem
08:14 - Hierarchical Patches Architecture
10:40 - Shifted Windows Attention
17:26 - Relative Positional Bias
21:58 - Results
26:00 - Next Up