Multi DeepSeek R1: STEP-GRPO RL MultiModal
My video explores new Ai research on R1 multi-Modal reasoning, and demonstrates clearly how StepGRPO’s step-wise rewards enable more reliable, structured, and logically sound reasoning in multimodal large language models. By offering continuous and detailed feedback on both accuracy and validity, these rewards foster incremental improvements that go beyond passive supervised imitation, resulting in superior performance demonstrated across multiple reasoning benchmarks.
All rights w/ authors:
"R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization"
Jingyi Zhang1, Jiaxing Huang1, Huanjin Yao2, Shunyu Liu1, Xikun Zhang1, Shijian Lu1, Dacheng Tao1
from
1 Nanyang Technological University and 2 Tsinghua University
#airesearch
#aiexplained
#deepseek
#r1