Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.

47.566 Lượt nghe
Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.
In this video, I will explain Reinforcement Learning from Human Feedback (RLHF) which is used to align, among others, models like ChatGPT. I will start by introducing how Language Models work and what we mean by AI alignment. In the second part of the video, I will derive from first principles the Policy Gradient Optimization algorithm, by explaining also the problems with the gradient calculation. I will describe the techniques used to reduce the variance of the estimator (by introducing the baseline) and how Off-Policy learning can make the training tractable. I will also describe how to build the reward model and explain the loss function of the reward model. To calculate the gradient of the policy, we need to calculate the log probabilities of the state-action pairs (the trajectories), the value function and the rewards, and the advantage terms (through Generalized Advantage Estimation): I will explain visually every step. After explaining Gradient Policy Optimization, I will introduce the Proximal Policy Optimization algorithm and its loss function, explaining all the details, including the loss of the value head and the entropy. In the last part of the video, I go through the implementation of RLHF/PPO, explaining line-by-line the entire process. For every mathematical formula, I will always given a visual intuition to help those who lack the mathematical background. PPO paper: Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. - https://arxiv.org/abs/1707.06347 InstructGPT paper: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, pp.27730-27744. - https://arxiv.org/abs/2203.02155 Generalized Advantage Estimation paper: Schulman, J., Moritz, P., Levine, S., Jordan, M. and Abbeel, P., 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. - https://arxiv.org/abs/1506.02438 Slides PDF and commented code: https://github.com/hkproj/rlhf-ppo Chapters 00:00:00 - Introduction 00:03:52 - Intro to Language Models 00:05:53 - AI Alignment 00:06:48 - Intro to RL 00:09:44 - RL for Language Models 00:11:01 - Reward model 00:20:39 - Trajectories (RL) 00:29:33 - Trajectories (Language Models) 00:31:29 - Policy Gradient Optimization 00:41:36 - REINFORCE algorithm 00:44:08 - REINFORCE algorithm (Language Models) 00:45:15 - Calculating the log probabilities 00:49:15 - Calculating the rewards 00:50:42 - Problems with Gradient Policy Optimization: variance 00:56:00 - Rewards to go 00:59:19 - Baseline 01:02:49 - Value function estimation 01:04:30 - Advantage function 01:10:54 - Generalized Advantage Estimation 01:19:50 - Advantage function (Language Models) 01:21:59 - Problems with Gradient Policy Optimization: sampling 01:24:08 - Importance Sampling 01:27:56 - Off-Policy Learning 01:33:02 - Proximal Policy Optimization (loss) 01:40:59 - Reward hacking (KL divergence) 01:43:56 - Code walkthrough 02:13:26 - Conclusion