Knowledge Distillation: How LLMs train each other

Knowledge Distillation: How LLMs train each other

17.445 Lượt nghe
Knowledge Distillation: How LLMs train each other
In this video, we break down knowledge distillation, the technique that powers models like Gemma 3, LLaMA 4 Scout & Maverick, and DeepSeek-R1. Distillation was prominently discussed at LlamaCon 2025. You’ll learn: • What knowledge distillation really is (and what it’s not) • How it helps scale LLMs without bloating inference cost • The origin story from ensembles and model compression (2006) to Hinton’s "dark knowledge" paper (2015) • Why "soft labels" carry more information than one-hot targets • How companies like Google, Meta, and DeepSeek apply distillation differently • The true meaning behind terms like temperature, behavioral cloning, and co-distillation Whether you’re building, training, or just trying to understand modern AI systems, this video gives you a deep but accessible introduction to how LLMs teach each other. 👉 Slide deck and paper list available for free on Patreon: https://www.patreon.com/c/juliaturc 00:00 – Intro 00:45 – Why distillation matters for scaling 02:26 – The 2006 origins: ensembles and model compression 05:45 – Hinton's 2015 paper: soft labels & dark knowledge 08:26 – What temperature really means 09:37 – Distillation in modern LLMs (Gemma, LLaMA, DeepSeek) 10:53 – Proper distillation vs. behavioral cloning 13:18 – Computational costs of distillation 14:16 – Co-distillation explained 15:32 – Outro