Visualization-of-Thought: Multimodal Models That Imagine While They Reason [Chengzu Li] - 722

Visualization-of-Thought: Multimodal Models That Imagine While They Reason [Chengzu Li] - 722

570 Lượt nghe
Visualization-of-Thought: Multimodal Models That Imagine While They Reason [Chengzu Li] - 722
Today, we're joined by Chengzu Li, PhD student at the University of Cambridge to discuss his recent paper, “Imagine while Reasoning in Space: Multimodal Visualization-of-Thought.” We explore the motivations behind MVoT, its connection to prior work like TopViewRS, and its relation to cognitive science principles such as dual coding theory. We dig into the MVoT framework along with its various task environments—maze, mini-behavior, and frozen lake. We explore token discrepancy loss, a technique designed to align language and visual embeddings, ensuring accurate and meaningful visual representations. Additionally, we cover the data collection and training process, reasoning over relative spatial relations between different entities, and dynamic spatial reasoning. Lastly, Chengzu shares insights from experiments with MVoT, focusing on the lessons learned and the potential for applying these models in real-world scenarios like robotics and architectural design. 🎧 / 🎥 Listen or watch the full episode on our page: https://twimlai.com/go/722. 🔔 Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1 🗣️ CONNECT WITH US! =============================== Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/ Follow us on Twitter: https://twitter.com/twimlai Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/ Join our Slack Community: https://twimlai.com/community/ Subscribe to our newsletter: https://twimlai.com/newsletter/ Want to get in touch? Send us a message: https://twimlai.com/contact/ 📖 CHAPTERS =============================== 00:00 - Introduction 4:15 - Motivation of MVoT 8:29 - TopViewRS 11:31 - Maze solving 12:50 - LLMs and multimodal models in spatial reasoning 15:19 - MVoT framework 22:23 - Token discrepancy loss 27:41 - Data collection and model training 31:35 - Lessons learned 33:31 - Fine-tuning 36:45 - Real-world scenarios 39:35 - Alternative approaches 🔗 LINKS & RESOURCES =============================== Imagine while Reasoning in Space: Multimodal Visualization-of-Thought - https://arxiv.org/abs/2501.07542 TopViewRS: Vision-Language Models as Top-View Spatial Reasoners - https://arxiv.org/abs/2406.02537 📸 Camera: https://amzn.to/3TQ3zsg 🎙️Microphone: https://amzn.to/3t5zXeV 🚦Lights: https://amzn.to/3TQlX49 🎛️ Audio Interface: https://amzn.to/3TVFAIq 🎚️ Stream Deck: https://amzn.to/3zzm7F5