Proximal Policy Optimization (PPO) - How to train Large Language Models

Proximal Policy Optimization (PPO) - How to train Large Language Models

53.458 Lượt nghe
Proximal Policy Optimization (PPO) - How to train Large Language Models
Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video! This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs. Full Playlist: https://www.youtube.com/playlist?list=PLs8w1Cdi-zvYviYYw_V3qe6SINReGF5M- Video 0 (Optional): Introduction to deep reinforcement learning https://www.youtube.com/watch?v=SgC6AZss478 Video 1 (This one): Proximal Policy Optimization Video 2: Reinforcement Learning with Human Feedback https://www.youtube.com/watch?v=Z_JUqJBpVOk Video 3 (Coming soon!): Deterministic Policy Optimization 00:00 Introduction 01:25 Gridworld 03:10 States and Action 04:01 Values 07:30 Policy 09:39 Neural Networks 16:14 Training the value neural network (Gain) 22:50 Training the policy neural network (Surrogate Objective Function) 33:38 Clipping the surrogate objective function 36:49 Summary Get the Grokking Machine Learning book! https://manning.com/books/grokking-machine-learning Discount code (40%): serranoyt (Use the discount code on checkout)