Fine-tune Text to Speech Models in 2025: CSM-1B and Orpheus TTS

Fine-tune Text to Speech Models in 2025: CSM-1B and Orpheus TTS

6.570 Lượt nghe
Fine-tune Text to Speech Models in 2025: CSM-1B and Orpheus TTS
📜 Get repo access at Trelis.com/ADVANCED-transcription 📧 Get the Trelis AI Newsletter: https://trelis.substack.com ❗️If you subscribed here, click the bell to be notified of new vids 🤝 Work for Trelis: https://trelis.com/jobs/ 💡 Need Technical or Market Assistance? Book a Consult Here: https://forms.gle/wJXVZXwioKMktjyVA 💸 Starting a New Project/Venture? Apply for a Trelis Grant: https://trelis.com/trelis-ai-grants/ Video Links: - Slides: https://docs.google.com/presentation/d/1N_05lO2rOu2dlEv10NrbJNjsCLV7-o6B7WUCu66BeFE/edit?usp=sharing - One-click Runpod template (affiliate): https://www.runpod.io/console/deploy?template=ifyqsvjlzj - Llama 3 Paper: https://arxiv.org/pdf/2407.21783 - StyleTTS2: https://arxiv.org/pdf/2306.07691 - Moshi: https://arxiv.org/pdf/2410.00037 - Orpheus: https://canopylabs.ai/model-releases - Sesame’s CSM-1B: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voic - Colab Notebook - Orpheus Cloning: https://colab.research.google.com/drive/18efbyjnUI_WcmfPPYex4xff0DzixAwsy?usp=sharing - Colab Notebook - Orpheus Inference: https://colab.research.google.com/drive/1W7t1YburdKrbOkLvNReAX0M9ogBThBWp?usp=sharing TIMESTAMPS: 00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 (?) 01:04 End-to-End Multimodal Models and Their Capabilities 02:36 Traditional Approaches to Text-to-Speech 03:06 Token-Based Approaches and Their Advantages 03:25 Detailed Look at Orpheus and CSM-1B Models 06:58 Training and Inference with Token-Based Models 12:53 Hierarchical Tokenization for High-Quality Audio 14:11 Kyutai’s Moshi Model for Text + Speech 23:41 Sesame’s CSM-1B Model Architecture 25:13 Orpheus TTS architecture by Canopy Labs 27:34 Inferencing and Cloning with CSM-1B 40:13 Context Aware Text to Speech with CSM-1B 48:21 Orpheus Inference and Cloning - FREE Colab 55:09 Orpheus Voice Cloning Setup 01:01:20 Orpheus Fine-tuning (Full fine-tuning and LoRA fine-tuning) 01:09:55 Running Full Fine Tuning 01:19:33 Running LoRa Fine Tuning 01:25:20 Inference and Comparison 01:29:27 Inference with Cloning AND fine-tuning 01:35:48 The future of token-based multi-modal models PS: I've rotated all hf access keys