Knowledge Distillation: How LLMs train each other

36.611 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Knowledge Distillation: How LLMs train each other

In this video, we break down knowledge distillation, the technique that powers models like Gemma 3, LLaMA 4 Scout & Maverick, and DeepSeek-R1. Distillation was prominently discussed at LlamaCon 2025.

You’ll learn:
 • What knowledge distillation really is (and what it’s not)
 • How it helps scale LLMs without bloating inference cost
 • The origin story from ensembles and model compression (2006) to Hinton’s "dark knowledge" paper (2015)
 • Why "soft labels" carry more information than one-hot targets
 • How companies like Google, Meta, and DeepSeek apply distillation differently
 • The true meaning behind terms like temperature, behavioral cloning, and co-distillation

Whether you’re building, training, or just trying to understand modern AI systems, this video gives you a deep but accessible introduction to how LLMs teach each other.

👉 Slide deck and paper list available for free on Patreon: https://www.patreon.com/c/juliaturc

00:00 – Intro
00:45 – Why distillation matters for scaling
02:26 – The 2006 origins: ensembles and model compression
05:45 – Hinton's 2015 paper: soft labels & dark knowledge
08:26 – What temperature really means
09:37 – Distillation in modern LLMs (Gemma, LLaMA, DeepSeek)
10:53 – Proper distillation vs. behavioral cloning
13:18 – Computational costs of distillation
14:16 – Co-distillation explained
15:32 – Outro					

Knowledge Distillation: How LLMs train each other

Nhạc Theo Chủ Đề

Liên kết website