In this video, we explore the groundbreaking Sesame Conversational Speech Model — a powerful speech-to-speech AI that can talk expressively, respond intelligently, and interact naturally. You will learn the architecture of Sesame Speech AI, How Mimi Encoder tokenizes audio using split RVQ (Residual Vector Quantization), the role of semantic and acoustic codes in audio understanding, and a step-by-step breakdown of the Autoregressive Transformer Backbone and Audio Decoder.
To support this channel, you can buy me a coffee at: https://ko-fi.com/neuralavb
Or, you can join our Patreon to get access to bonus content used in all my videos. You will get the slides, notebooks, code snippets, word docs, and animations that went into producing this video. Here is the link:
https://www.patreon.com/NeuralBreakdownwithAVB
Visit AI Agent Store Page: https://aiagentstore.ai/?ref=avishek
References:
Sesame Blogpost and Demo: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice
Relevant papers:
Moshi: https://arxiv.org/abs/2410.00037
SoundStream: https://arxiv.org/abs/2107.03312
HuBert: https://arxiv.org/abs/2106.07447
Speech Tokenizer: https://arxiv.org/abs/2308.16692
#pytorch #transformers #deepseek
Videos and playlists you would like:
Attention to Transformers playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW
Guide to fine-tuning open source LLMs:
https://youtu.be/bZcKYiwtw1I
Generative Language Modeling from scratch:
https://youtu.be/s3OUzmUDdg8