Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

47.566 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models like ChatGPT. I will start by introducing how Language Models work and what we mean by AI alignment. In the second part of the video, I will derive from first principles the Policy Gradient Optimization algorithm, by explaining also the problems with the gradient calculation. I will describe the techniques used to reduce the variance of the estimator (by introducing the baseline) and how Off-Policy learning can make the training tractable.
I will also describe how to build the reward model and explain the loss function of the reward model.
To calculate the gradient of the policy, we need to calculate the log probabilities of the state-action pairs (the trajectories), the value function and the rewards, and the advantage terms (through Generalized Advantage Estimation): I will explain visually every step.
After explaining Gradient Policy Optimization, I will introduce the Proximal Policy Optimization algorithm and its loss function, explaining all the details, including the loss of the value head and the entropy.
In the last part of the video, I go through the implementation of RLHF/PPO, explaining line-by-line the entire process.

For every mathematical formula, I will always given a visual intuition to help those who lack the mathematical background.

PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. - https://arxiv.org/abs/1707.06347

InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. - https://arxiv.org/abs/2203.02155

Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. - https://arxiv.org/abs/1506.02438

Slides PDF and commented code: https://github.com/hkproj/rlhf-ppo

Chapters
00:00 - Introduction
03:52 - Intro to Language Models
05:53 - AI Alignment
06:48 - Intro to RL
09:44 - RL for Language Models
11:01 - Reward model
20:39 - Trajectories (RL)
29:33 - Trajectories (Language Models)
31:29 - Policy Gradient Optimization
41:36 - REINFORCE algorithm
44:08 - REINFORCE algorithm (Language Models)
45:15 - Calculating the log probabilities
49:15 - Calculating the rewards
50:42 - Problems with Gradient Policy Optimization: variance
56:00 - Rewards to go
59:19 - Baseline
02:49 - Value function estimation
04:30 - Advantage function
10:54 - Generalized Advantage Estimation
19:50 - Advantage function (Language Models)
21:59 - Problems with Gradient Policy Optimization: sampling
24:08 - Importance Sampling
27:56 - Off-Policy Learning
33:02 - Proximal Policy Optimization (loss)
40:59 - Reward hacking (KL divergence)
43:56 - Code walkthrough
13:26 - Conclusion					

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Nhạc Theo Chủ Đề

Liên kết website