Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw
Here, I discuss the technical details behind the recent “advanced reasoning” models trained on large-scale reinforcement learning i.e. o1 and DeepSeek-R1.
📰 Read more: https://shawhin.medium.com/how-to-train-llms-to-think-like-o1-deepseek-r1-eabc21c8842d?source=friends_link&sk=ec3e7ca77cd47f76ce38015c87ba5084
References
[1] https://openai.com/index/learning-to-reason-with-llms/
[2] arXiv:2501.12948 [cs.CL]
[3]
https://youtu.be/7xTGNNLPyMI
[4] https://huggingface.co/datasets/open-r1/OpenR1-Math-220k
[5] https://discovery.ucl.ac.uk/id/eprint/10045895/1/agz_unformatted_nature.pdf
Intro -
0:00
OpenAI's o1 -
0:33
Test-time Compute -
1:33
"Thinking" Tokens -
3:50
DeepSeek Paper -
5:58
Reinforcement Learning -
7:22
R1-Zero: Prompt Template -
9:28
R1-Zero: Reward -
10:53
R1-Zero: GRPO (technical) -
12:53
R1-Zero: Results -
20:00
DeepSeek R1 -
23:32
Step 1: SFT with CoT -
24:47
Step 2: R1-Zero Style RL -
26:14
Step 3: SFT with Mixed Data -
27:03
Step 4: RL & RLHF -
28:26
Accessing DeepSeek Models -
29:18
Conclusions -
30:10
Homepage: https://www.shawhintalebi.com/