In this tutorial, we’ll walk you through building a Speech-to-Text (STT) audio transcription model from scratch using PyTorch. This isn’t just another use-a-library video—we code the full pipeline including:
-1D Convolutional Layers for raw waveform processing
- Transformer-style Self-Attention Layers
- Residual Vector Quantization (RVQ) for efficient representation
- CTC Loss for sequence alignment
Whether you're a beginner in deep learning for speech recognition or you're looking to understand the internals of models like wav2vec 2.0, this hands-on guide will help you build intuition and practical coding skills.
Buy me a coffee at https://ko-fi.com/neuralavb !
To join our Patreon, visit: https://www.patreon.com/NeuralBreakdownwithAVB
Members get access to EVERYTHING behind-the-scenes that go into producing my videos. Plus, it supports the channel in a big way and helps to pay my bills.
Useful videos for more deep dive:
Speech to Speech Models (Sesame) -
https://youtu.be/ThG9EBbMhP8
Visualizing CNNs -
https://youtu.be/kebSR2Ph7zg
The Entire History of CNNS:
https://youtu.be/N_PocrMHWbw
Attention Zero to Hero Playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW
Coding Attention from scratch -
https://youtu.be/s3OUzmUDdg8
References to papers/articles:
CTC Loss: https://distill.pub/2017/ctc/
Timestamps:
0:00 - Intro
0:36 - How Audio datasets look like
4:30 - Tokenizing text
9:34 - Data Preprocessing
11:38 - MFCCs, and Encoder-Decoder networks
14:20 - Network Architecture
17:59 - Coding the Convolutional Block
26:40 - Coding attention and Transformers
30:20 - Residual Vector Quantizers
32:57 - Coding RVQs
37:44 - Optimizing RVQs
43:50 - Putting it together
48:50 - Connectionist-Temporal Classification (CTC) Loss
50:53 - Training!
#speechtotext
#pytorch
#deeplearning
#neuralnetworks