An Unexpected Reinforcement Learning Renaissance

An Unexpected Reinforcement Learning Renaissance

13.326 Lượt nghe
An Unexpected Reinforcement Learning Renaissance
The era we are living through in language modeling research is one pervasive with complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well founded. A day cannot go by without | a new reasoning model, RL training result, or dataset distilled from DeepSeek R1. More information: https://www.interconnects.ai/p/an-unexpected-rl-renaissance Slides: https://docs.google.com/presentation/d/1z-i4NuqSc7rFKmI9zNnz3CJ_eshUBpJpbvcswCeWvPk/edit#slide=id.p 00:00 The ingredients of an RL paradigm shift 16:04 RL with verifiable rewards 27:38 What DeepSeek R1 taught us 29:30 RL as the focus of language modeling The difference, compared to the last time RL was in the forefront of the AI world with the fact that reinforcement learning from human feedback (RLHF) was needed to create ChatGPT, is that we have way better infrastructure than our first time through this. People are already successfully using TRL, OpenRLHF, veRL, and of course Open Instruct (our tools for Tülu 3/OLMo) to train models like this. When models such as Alpaca, Vicuña, Dolly, etc. were coming out they were all built on basic instruction tuning. Even though RLHF was the motivation of these experiments, tooling and lack of datasets made complete and substantive replications rare. On top of that, every organization was trying to recalibrate its AI strategy for the second time in 6 months. The reaction and excitement of Stable Diffusion was all but overwritten by ChatGPT. This time is different. With reasoning models, everyone already has raised money for their AI companies, open-source tooling for RLHF exists and is stable, and everyone already is feeling the AGI. The goal of this talk is to try and make sense of the story that is unfolding today: Given it is becoming obvious that RL with verifiable rewards works on old models — why did the AI community sleep on the potential of these reasoning models? How to contextualize the development of RLHF techniques with the new types of RL training? What is the future of post-training? How far can we scale RL? How does today’s RL compare to historical successes of Deep RL? And other topics. This is longer recording of a talk I gave this week at a local Seattle research meetup. Some of the key points I arrived on: RLHF was necessary, but not sufficient for ChatGPT. RL training like for reasoning could become the primary driving force of future LM developments. There’s a path for “post-training” to just be called “training” in the future. While this will feel like the Alpaca moment from 2 years ago, it will produce much deeper results and impact. Self-play, inference-time compute, and other popular terms related to this movement are more “side quests” than core to the developments. There is just so much low-hanging fruit for improving models with RL. It’s wonderfully exciting. For the rest, you’ll have to watch the talk. Get Interconnects (https://www.interconnects.ai/)... ... on YouTube: https://www.youtube.com/@interconnects ... on Twitter: https://x.com/interconnectsai ... on Linkedin: https://www.linkedin.com/company/interconnects-ai ... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv … on Apple Podcasts: https://podcasts.apple.com/us/podcast/interconnects/id1719552353