RLHF & DPO Explained (In Simple Terms!)

RLHF & DPO Explained (In Simple Terms!)

9.275 Lượt nghe
RLHF & DPO Explained (In Simple Terms!)
Learn how Reinforcement Learning from Human Feedback (RLHF) actually works and why Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are changing the game. This video doesn't go deep on math. Instead, I provide a high-level overview of each technique to help you make practical decisions about where to focus your time and energy. 0:52 The Idea of Reinforcement Learning 1:55 Reinforcement Learning from Human Feedback (RLHF) 4:21 RLHF in a Nutshell 5:06 RLHF Variations 6:11 Challenges with RLHF 7:02 Direct Preference Optimization (DPO) 7:47 Preferences Dataset Example 8:29 DPO in a Nutshell 9:25 DPO Advantages over RLHF 10:32 Challenges with DPO 10:50 Kahneman-Tversky Optimization (KTO) 11:39 Prospect Theory 13:35 Sigmoid vs Value Function 13:49 KTO Dataset 15:28 KTO in a Nutshell 15:54 Advantages of KTO 18:03 KTO Hyperparameters These are the three papers referenced in the video: 1. Deep reinforcement learning from human preferences (https://arxiv.org/abs/1706.03741) 2. Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/abs/2305.18290) 3. KTO: Model Alignment as Prospect Theoretic Optimization (https://arxiv.org/abs/2402.01306) The Huggingface TRL library offers implementations for PPO, DPO, and KTO: https://huggingface.co/docs/trl/main/en/kto_trainer Want to prototype with prompts and supervised fine-tuning? Try Entry Point AI: https://www.entrypointai.com/ How about connecting? I'm on LinkedIn: https://www.linkedin.com/in/markhennings/