Here's an overview of the DeepSeek R1 paper. I read the paper this week and I was fascinated by the methods, however it was a bit difficult to follow what was going on with all the models being used.
I found a neat map of the methodology which I'll be using in this tutorial to walk you through the paper.
I strongly recommend you to still read the paper over here:
📌 PAPER: https://arxiv.org/pdf/2501.12948
and also to check out these other two video for the GRPO bit:
📌
https://www.youtube.com/watch?v=XMnxKGVnEUc&ab_channel=UmarJamil
📌
https://www.youtube.com/watch?v=bAWV_yrqx4w&ab_channel=YannicKilcher
btw map I'm using is over here:
https://www.reddit.com/r/LocalLLaMA/comments/1i66j4f/deepseekr1_training_pipeline_visualized/
Table of content
- Introduction:
0:00
- DeepSeek R1-zero path:
2:23
- Reinforcement learning setup:
3:59
- Group Relative Policy Optimization (GRPO):
7:03
- DeepSeek R1-zero result:
11:40
- Cold start supervised fine-tuning:
15:30
- Consistency reward for CoT:
16:19
- Supervised Fine tuning data generation:
17:17
- Reinforcement learning with neural reward model:
19:47
- Distillation:
21:26
- Conclusion:
24:34
----
Join the newsletter for weekly AI content: https://yacinemahdid.com
Join the Discord for general discussion: https://discord.gg/QpkxRbQBpf
----
Follow Me Online Here:
GitHub: https://github.com/yacineMahdid
LinkedIn: https://www.linkedin.com/in/yacinemahdid/
___
Have a great week! 👋