Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!

1.373 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!

In this tutorial, we’ll walk you through building a Speech-to-Text (STT) audio transcription model from scratch using PyTorch. This isn’t just another use-a-library video—we code the full pipeline including:

-1D Convolutional Layers for raw waveform processing
- Transformer-style Self-Attention Layers
- Residual Vector Quantization (RVQ) for efficient representation
- CTC Loss for sequence alignment

Whether you're a beginner in deep learning for speech recognition or you're looking to understand the internals of models like wav2vec 2.0, this hands-on guide will help you build intuition and practical coding skills.

Buy me a coffee at https://ko-fi.com/neuralavb !
To join our Patreon, visit: https://www.patreon.com/NeuralBreakdownwithAVB

Members get access to EVERYTHING behind-the-scenes that go into producing my videos. Plus, it supports the channel in a big way and helps to pay my bills.

Useful videos for more deep dive:
Speech to Speech Models (Sesame) - https://youtu.be/ThG9EBbMhP8

Visualizing CNNs - https://youtu.be/kebSR2Ph7zg
The Entire History of CNNS: https://youtu.be/N_PocrMHWbw

Attention Zero to Hero Playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW
Coding Attention from scratch - https://youtu.be/s3OUzmUDdg8


References to papers/articles:
CTC Loss: https://distill.pub/2017/ctc/

Timestamps:
0:00 - Intro
0:36 - How Audio datasets look like
4:30 - Tokenizing text
9:34 - Data Preprocessing
11:38 - MFCCs, and Encoder-Decoder networks
14:20 - Network Architecture
17:59 - Coding the Convolutional Block
26:40 - Coding attention and Transformers
30:20 - Residual Vector Quantizers
32:57 - Coding RVQs
37:44 - Optimizing RVQs
43:50 - Putting it together
48:50 - Connectionist-Temporal Classification (CTC) Loss
50:53 - Training!

#speechtotext 
#pytorch
#deeplearning 
#neuralnetworks					

Building awesome Speech To Text Transformers from scratch - One line of Pytorch at a time!

Nhạc Theo Chủ Đề

Liên kết website