Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

134.147 Lượt nghe
Vision Transformer Quick Guide - Theory and Code in (almost) 15 min
▬▬ Papers / Resources ▬▬▬ Colab Notebook: https://colab.research.google.com/drive/1P9TPRWsDdqJC6IvOxjG2_3QlgCt59P0w?usp=sharing ViT paper: https://arxiv.org/abs/2010.11929 Best Transformer intro: https://jalammar.github.io/illustrated-transformer/ CNNs vs ViT: https://arxiv.org/abs/2108.08810 CNNs vs ViT Blog: https://towardsdatascience.com/do-vision-transformers-see-like-convolutional-neural-networks-paper-explained-91b4bd5185c8 Swin Transformer: https://arxiv.org/abs/2103.14030 DeiT: https://arxiv.org/abs/2012.12877 ▬▬ Support me if you like 🌟 ►Link to this channel: https://bit.ly/3zEqL1W ►Support me on Patreon: https://bit.ly/2Wed242 ►Buy me a coffee on Ko-Fi: https://bit.ly/3kJYEdl ►E-Mail: [email protected] ▬▬ Used Music ▬▬▬▬▬▬▬▬▬▬▬ Music from #Uppbeat (free for Creators!): https://uppbeat.io/t/92elm/jasmine License code: SMTWRWLNGHZHH0OC ▬▬ Used Icons ▬▬▬▬▬▬▬▬▬▬ All Icons are from flaticon: https://www.flaticon.com/authors/freepik ▬▬ Timestamps ▬▬▬▬▬▬▬▬▬▬▬ 00:00 Introduction 00:16 ViT Intro 01:12 Input embeddings 01:50 Image patching 02:54 Einops reshaping 04:13 [CODE] Patching 05:35 CLS Token 06:40 Positional Embeddings 08:09 Transformer Encoder 08:30 Multi-head attention 08:50 [CODE] Multi-head attention 09:12 Layer Norm 09:30 [CODE] Layer Norm 09:55 Feed Forward Head 10:05 Feed Forward Head 10:21 Residuals 10:45 [CODE] final ViT 13:10 CNN vs. ViT 14:45 ViT Variants ▬▬ My equipment 💻 - Microphone: https://amzn.to/3DVqB8H - Microphone mount: https://amzn.to/3BWUcOJ - Monitors: https://amzn.to/3G2Jjgr - Monitor mount: https://amzn.to/3AWGIAY - Height-adjustable table: https://amzn.to/3aUysXC - Ergonomic chair: https://amzn.to/3phQg7r - PC case: https://amzn.to/3jdlI2Y - GPU: https://amzn.to/3AWyzwy - Keyboard: https://amzn.to/2XskWHP - Bluelight filter glasses: https://amzn.to/3pj0fK2