Multimodal AI: LLMs that can see (and hear)

Multimodal AI: LLMs that can see (and hear)

9.751 Lượt nghe
Multimodal AI: LLMs that can see (and hear)
Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw Multimodal (Large) Language Models expand an LLM's text-only capabilities to include other modalities. Here are three ways to do this. Resources: 📰 Blog: https://medium.com/towards-data-science/multimodal-models-llms-that-can-see-and-hear-5c6737c981d3?sk=d0897db8457c91706170d3043ebdbcf0 ▶️ LLM Playlist: https://youtu.be/eC6Hd1hFvos 💻 GitHub Repo: https://github.com/ShawhinT/YouTube-Blog/tree/main/multimodal-ai References: [1] Multimodal Machine Learning: https://arxiv.org/abs/1705.09406 [2] A Survey on Multimodal Large Language Models: https://arxiv.org/abs/2306.13549 [3] Visual Instruction Tuning: https://arxiv.org/abs/2304.08485 [4] GPT-4o System Card: https://arxiv.org/abs/2410.21276 [5] Janus: https://arxiv.org/abs/2410.13848 [6] Learning Transferable Visual Models From Natural Language Supervision: https://arxiv.org/abs/2103.00020 [7] Flamingo: https://arxiv.org/abs/2204.14198 [8] Mini-Omni2: https://arxiv.org/abs/2410.11190 [9] Emu3: https://arxiv.org/abs/2409.18869 [10] Chameleon: https://arxiv.org/abs/2405.09818 -- Homepage: https://www.shawhintalebi.com Introduction - 0:00 Multimodal LLMs - 1:49 Path 1: LLM + Tools - 4:24 Path 2: LLM + Adapaters - 7:20 Path 3: Unified Models - 11:19 Example: LLaMA 3.2 for Vision Tasks (Ollama) - 13:24 What's next? - 19:58