We will implement ViT (Vision Transformer) and train our implementation on the MNIST dataset to classify images! Video where I explain the ViT paper and GitHub below ↓
Want to support the channel? Hit that like button and subscribe!
ViT (Vision Transformer) - An Image Is Worth 16x16 Words (Paper Explained)
https://www.youtube.com/watch?v=8phM16htKbU
GitHub Link of the Code
https://github.com/uygarkurt/ViT-PyTorch
Notebook
https://github.com/uygarkurt/ViT-PyTorch/blob/main/vit-implementation.ipynb
ViT (Vision Transformer) is introduced in the paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
https://arxiv.org/abs/2010.11929
What should I implement next? Let me know in the comments!
00:00:00 Introduction
00:00:09 Paper Overview
00:02:41 Imports and Hyperparameter Definitions
00:11:09 Patch Embedding Implementation
00:19:36 ViT Implementation
00:29:00 Dataset Preparation
00:51:16 Train Loop
01:09:27 Prediction Loop
01:12:05 Classifying Our Own Images
Buy me a coffee! ☕️
https://ko-fi.com/uygarkurt