Training LLM to play chess using Deepseek GRPO reinforcement learning

13.206 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Training LLM to play chess using Deepseek GRPO reinforcement learning

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io

In this video, we see how popular LLMs like GPT-4o, o1 Reasoning, and DeepSeek R1 show some understanding of chess, they often fail to play legal moves. To address this, we train our own reasoning-focused chess LLM using the Group Relative Policy Optimization (GRPO) method introduced in DeepSeek R1. We walk through how GRPO differs from traditional PPO (Proximal Policy Optimization) and fine-tune LLaMA 8B and Qwen 7B using TRL (Transformers Reinforcement Learning) and Unsloth libraries - the results are surprising! Finally, we review some other chess-playing neural networks like Deepmind's Grandmaster Chess without Search and ChessGPT.

00 - Introduction
18 - Chess RL Strategy
51 - How well do the best LLMs understand chess?
41 - Picking a base model
31 - Unsloth and TRL libraries for RL with LLMs
38 - LoRA (Low Rank Adaptation)
55 - GSM8K reasoning example
06 - PPO (Proximal Policy Optimization)
12 - GRPO (Group Relative Policy Optimization)
15 - GRPO training results
11 - Analysis of results for LLaMA and Qwen
52 - Limitations of GRPO on small models
29 - Grandmaster-level chess without search
10 - ChessGPT and other LLMs that play chess					

Training LLM to play chess using Deepseek GRPO reinforcement learning

Nhạc Theo Chủ Đề

Liên kết website