Sesame AI and RVQs - the network architecture behind VIRAL speech models

Sesame AI and RVQs - the network architecture behind VIRAL speech models

1.167 Lượt nghe
Sesame AI and RVQs - the network architecture behind VIRAL speech models
In this video, we explore the groundbreaking Sesame Conversational Speech Model — a powerful speech-to-speech AI that can talk expressively, respond intelligently, and interact naturally. You will learn the architecture of Sesame Speech AI, How Mimi Encoder tokenizes audio using split RVQ (Residual Vector Quantization), the role of semantic and acoustic codes in audio understanding, and a step-by-step breakdown of the Autoregressive Transformer Backbone and Audio Decoder. To support this channel, you can buy me a coffee at: https://ko-fi.com/neuralavb Or, you can join our Patreon to get access to bonus content used in all my videos. You will get the slides, notebooks, code snippets, word docs, and animations that went into producing this video. Here is the link: https://www.patreon.com/NeuralBreakdownwithAVB Visit AI Agent Store Page: https://aiagentstore.ai/?ref=avishek References: Sesame Blogpost and Demo: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice Relevant papers: Moshi: https://arxiv.org/abs/2410.00037 SoundStream: https://arxiv.org/abs/2107.03312 HuBert: https://arxiv.org/abs/2106.07447 Speech Tokenizer: https://arxiv.org/abs/2308.16692 #pytorch #transformers #deepseek Videos and playlists you would like: Attention to Transformers playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW Guide to fine-tuning open source LLMs: https://youtu.be/bZcKYiwtw1I Generative Language Modeling from scratch: https://youtu.be/s3OUzmUDdg8