DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

10.355 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

In this video, I break down DeepSeek's Group Relative Policy Optimization (GRPO) from first principles, without assuming prior knowledge of Reinforcement Learning. By the end, you’ll understand the core RL building blocks that led to GRPO, including:

🔵 Policy Gradient Methods
🔵 The REINFORCE Algorithm
🔵 Actor-Critic Models
🔵 PPO (Proximal Policy Optimization)
🔵 GRPO (Group-Relative policy Optimization)

Papers:
GRPO paper (DeepSeekMath): https://arxiv.org/pdf/2402.03300
DeepSeek-R1 paper: https://arxiv.org/pdf/2501.12948
PPO paper: https://arxiv.org/pdf/1707.06347
GAE paper: https://arxiv.org/pdf/1506.02438
TRPO paper: https://arxiv.org/pdf/1502.05477

Mother of all RL books (Barto & Sutton):
http://incompleteideas.net/book/RLboo... 

00:00 Intro
00:53 Where GRPO fits within the LLM training pipeline
04:17 RL fundamentals for LLMs
08:25 Policy Gradient Methods & REINFORCE 
11:58 Reward baselines & Actor-Critic Methods
14:10 GRPO
21:42 Wrap-up: PPO vs GRPO
22:32 Research papers are like Instagram					

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

Nhạc Theo Chủ Đề

Liên kết website

DeepSeek's GRPO (Group Relative Policy Optimization) | Reinforcement Learning for LLMs

Những bài liên quan

Chưa có bài liên quan nào!