Self-Reflecting LLMs: Reinforcement Learning That Boosts Reasoning [Maohao Shen] - 726

Self-Reflecting LLMs: Reinforcement Learning That Boosts Reasoning [Maohao Shen] - 726

1.004 Lượt nghe
Self-Reflecting LLMs: Reinforcement Learning That Boosts Reasoning [Maohao Shen] - 726
Today, we're joined by Maohao Shen, PhD student at MIT to discuss his paper, “Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search.” We dig into how Satori leverages reinforcement learning to improve language model reasoning—enabling model self-reflection, self-correction, and exploration of alternative solutions. We explore the Chain-of-Action-Thought (COAT) approach, which uses special tokens—continue, reflect, and explore—to guide the model through distinct reasoning actions, allowing it to navigate complex reasoning tasks without external supervision. We also break down Satori’s two-stage training process: format tuning, which teaches the model to understand and utilize the special action tokens, and reinforcement learning, which optimizes reasoning through trial-and-error self-improvement. We cover key techniques such “restart and explore,” which allows the model to self-correct and generalize beyond its training domain. Finally, Maohao reviews Satori’s performance and how it compares to other models, the reward design, the benchmarks used, and the surprising observations made during the research. 🎧 / 🎥 Listen or watch the full episode on our page: https://twimlai.com/go/726. 🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1 🗣️ CONNECT WITH US! =============================== Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/ Follow us on Twitter: https://twitter.com/twimlai Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/ Join our Slack Community: https://twimlai.com/community/ Subscribe to our newsletter: https://twimlai.com/newsletter/ Want to get in touch? Send us a message: https://twimlai.com/contact/ 📖 CHAPTERS =============================== 00:00 - Introduction 3:40 - How Satori paper fits into current trends in the AI research field 11:03 - Motivation of Satori 17:01- Autoregressive search 21:20 - Chain-of-Action-Thought Reasoning (COAT) 23:11 - Challenges 23:54 - COAT reasoning, imitation learning, and format tuning 28:42 - Two stages of training 34:18 - Relationship of format tuning and self-improvement 37:47 - Performance 39:46 - Reward design of the RL component 42:27 - Base model 44:21 - Benchmarks and results 48:32 - Future directions 🔗 LINKS & RESOURCES =============================== Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search - https://arxiv.org/abs/2502.02508 📸 Camera: https://amzn.to/3TQ3zsg 🎙️Microphone: https://amzn.to/3t5zXeV 🚦Lights: https://amzn.to/3TQlX49 🎛️ Audio Interface: https://amzn.to/3TVFAIq 🎚️ Stream Deck: https://amzn.to/3zzm7F5