In this video, we break down knowledge distillation, the technique that powers models like Gemma 3, LLaMA 4 Scout & Maverick, and DeepSeek-R1. Distillation was prominently discussed at LlamaCon 2025.
You’ll learn:
• What knowledge distillation really is (and what it’s not)
• How it helps scale LLMs without bloating inference cost
• The origin story from ensembles and model compression (2006) to Hinton’s "dark knowledge" paper (2015)
• Why "soft labels" carry more information than one-hot targets
• How companies like Google, Meta, and DeepSeek apply distillation differently
• The true meaning behind terms like temperature, behavioral cloning, and co-distillation
Whether you’re building, training, or just trying to understand modern AI systems, this video gives you a deep but accessible introduction to how LLMs teach each other.
👉 Slide deck and paper list available for free on Patreon: https://www.patreon.com/c/juliaturc
00:00 – Intro
00:45 – Why distillation matters for scaling
02:26 – The 2006 origins: ensembles and model compression
05:45 – Hinton's 2015 paper: soft labels & dark knowledge
08:26 – What temperature really means
09:37 – Distillation in modern LLMs (Gemma, LLaMA, DeepSeek)
10:53 – Proper distillation vs. behavioral cloning
13:18 – Computational costs of distillation
14:16 – Co-distillation explained
15:32 – Outro