📜 Get repo access at Trelis.com/ADVANCED-transcription
📧 Get the Trelis AI Newsletter: https://trelis.substack.com
❗️If you subscribed here, click the bell to be notified of new vids
🤝 Work for Trelis: https://trelis.com/jobs/
💡 Need Technical or Market Assistance?
Book a Consult Here: https://forms.gle/wJXVZXwioKMktjyVA
💸 Starting a New Project/Venture?
Apply for a Trelis Grant: https://trelis.com/trelis-ai-grants/
Video Links:
- Slides: https://docs.google.com/presentation/d/1N_05lO2rOu2dlEv10NrbJNjsCLV7-o6B7WUCu66BeFE/edit?usp=sharing
- One-click Runpod template (affiliate): https://www.runpod.io/console/deploy?template=ifyqsvjlzj
- Llama 3 Paper: https://arxiv.org/pdf/2407.21783
- StyleTTS2: https://arxiv.org/pdf/2306.07691
- Moshi: https://arxiv.org/pdf/2410.00037
- Orpheus: https://canopylabs.ai/model-releases
- Sesame’s CSM-1B: https://www.sesame.com/research/crossing_the_uncanny_valley_of_voic
- Colab Notebook - Orpheus Cloning: https://colab.research.google.com/drive/18efbyjnUI_WcmfPPYex4xff0DzixAwsy?usp=sharing
- Colab Notebook - Orpheus Inference: https://colab.research.google.com/drive/1W7t1YburdKrbOkLvNReAX0M9ogBThBWp?usp=sharing
TIMESTAMPS:
00:00 Introduction to End-to-End Audio + Text Models like GPT-4o and Llama 4 (?)
01:04 End-to-End Multimodal Models and Their Capabilities
02:36 Traditional Approaches to Text-to-Speech
03:06 Token-Based Approaches and Their Advantages
03:25 Detailed Look at Orpheus and CSM-1B Models
06:58 Training and Inference with Token-Based Models
12:53 Hierarchical Tokenization for High-Quality Audio
14:11 Kyutai’s Moshi Model for Text + Speech
23:41 Sesame’s CSM-1B Model Architecture
25:13 Orpheus TTS architecture by Canopy Labs
27:34 Inferencing and Cloning with CSM-1B
40:13 Context Aware Text to Speech with CSM-1B
48:21 Orpheus Inference and Cloning - FREE Colab
55:09 Orpheus Voice Cloning Setup
01:01:20 Orpheus Fine-tuning (Full fine-tuning and LoRA fine-tuning)
01:09:55 Running Full Fine Tuning
01:19:33 Running LoRa Fine Tuning
01:25:20 Inference and Comparison
01:29:27 Inference with Cloning AND fine-tuning
01:35:48 The future of token-based multi-modal models
PS: I've rotated all hf access keys