Sesame AI and RVQs - the network architecture behind VIRAL speech models

1.167 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Sesame AI and RVQs - the network architecture behind VIRAL speech models

In this video, we explore the groundbreaking Sesame Conversational Speech Model — a powerful speech-to-speech AI that can talk expressively, respond intelligently, and interact naturally. You will learn the architecture of Sesame Speech AI, How Mimi Encoder tokenizes audio using split RVQ (Residual Vector Quantization), the role of semantic and acoustic codes in audio understanding, and a step-by-step breakdown of the Autoregressive Transformer Backbone and Audio Decoder.

To support this channel, you can buy me a coffee at: https://ko-fi.com/neuralavb

Or, you can join our Patreon to get access to bonus content used in all my videos. You will get the slides, notebooks, code snippets, word docs, and animations that went into producing this video. Here is the link:
https://www.patreon.com/NeuralBreakdownwithAVB

Visit AI Agent Store Page: https://aiagentstore.ai/?ref=avishek

References:
Sesame Blogpost and Demo: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice

Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692


#pytorch #transformers #deepseek  

Videos and playlists you would like:
Attention to Transformers playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW
Guide to fine-tuning open source LLMs: https://youtu.be/bZcKYiwtw1I
Generative Language Modeling from scratch: https://youtu.be/s3OUzmUDdg8					

Sesame AI and RVQs - the network architecture behind VIRAL speech models

Nhạc Theo Chủ Đề

Liên kết website