Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!

Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!

1.373 Lượt nghe
Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!
In this tutorial, we’ll walk you through building a Speech-to-Text (STT) audio transcription model from scratch using PyTorch. This isn’t just another use-a-library video—we code the full pipeline including: -1D Convolutional Layers for raw waveform processing - Transformer-style Self-Attention Layers - Residual Vector Quantization (RVQ) for efficient representation - CTC Loss for sequence alignment Whether you're a beginner in deep learning for speech recognition or you're looking to understand the internals of models like wav2vec 2.0, this hands-on guide will help you build intuition and practical coding skills. Buy me a coffee at https://ko-fi.com/neuralavb ! To join our Patreon, visit: https://www.patreon.com/NeuralBreakdownwithAVB Members get access to EVERYTHING behind-the-scenes that go into producing my videos. Plus, it supports the channel in a big way and helps to pay my bills. Useful videos for more deep dive: Speech to Speech Models (Sesame) - https://youtu.be/ThG9EBbMhP8 Visualizing CNNs - https://youtu.be/kebSR2Ph7zg The Entire History of CNNS: https://youtu.be/N_PocrMHWbw Attention Zero to Hero Playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW Coding Attention from scratch - https://youtu.be/s3OUzmUDdg8 References to papers/articles: CTC Loss: https://distill.pub/2017/ctc/ Timestamps: 0:00 - Intro 0:36 - How Audio datasets look like 4:30 - Tokenizing text 9:34 - Data Preprocessing 11:38 - MFCCs, and Encoder-Decoder networks 14:20 - Network Architecture 17:59 - Coding the Convolutional Block 26:40 - Coding attention and Transformers 30:20 - Residual Vector Quantizers 32:57 - Coding RVQs 37:44 - Optimizing RVQs 43:50 - Putting it together 48:50 - Connectionist-Temporal Classification (CTC) Loss 50:53 - Training! #speechtotext #pytorch #deeplearning #neuralnetworks