Turns out reinforcement learning is all you need
Check out my prior video on RL:
https://youtu.be/qTY4Rr-x5q0?si=pgTpw9r9xwkuZJM6
Resources:
Code: https://github.com/ALucek/GRPO-Training/tree/main
Model: https://huggingface.co/AdamLucek/Qwen2.5-3B-Instruct-GRPO-2K-GSM8K
DeepSeek-R1 Paper: https://arxiv.org/pdf/2501.12948
DeepSeek Math Paper: https://arxiv.org/pdf/2402.03300
Unsloth Reasoning Blog: https://unsloth.ai/blog/r1-reasoning
Willccbb’s GRPO Demo: https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb
Chapters:
00:00 - LLM Reasoning
01:44 - PPO Context
05:07 - GRPO Algorithm
07:24 - DeepSeek-R1-Zero Training
10:41 - DeepSeek-R1 Training
14:41 - Training: Model Loading
19:17 - Training: Dataset Prep
21:24 - Training: Reward Functions
23:11 - Training: GRPO Trainer
24:05 - Training: Outcome and Inference
#ai #datascience #programming