Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw
Multimodal (Large) Language Models expand an LLM's text-only capabilities to include other modalities. Here are three ways to do this.
Resources:
📰 Blog: https://medium.com/towards-data-science/multimodal-models-llms-that-can-see-and-hear-5c6737c981d3?sk=d0897db8457c91706170d3043ebdbcf0
▶️ LLM Playlist:
https://youtu.be/eC6Hd1hFvos
💻 GitHub Repo: https://github.com/ShawhinT/YouTube-Blog/tree/main/multimodal-ai
References:
[1] Multimodal Machine Learning: https://arxiv.org/abs/1705.09406
[2] A Survey on Multimodal Large Language Models: https://arxiv.org/abs/2306.13549
[3] Visual Instruction Tuning: https://arxiv.org/abs/2304.08485
[4] GPT-4o System Card: https://arxiv.org/abs/2410.21276
[5] Janus: https://arxiv.org/abs/2410.13848
[6] Learning Transferable Visual Models From Natural Language Supervision: https://arxiv.org/abs/2103.00020
[7] Flamingo: https://arxiv.org/abs/2204.14198
[8] Mini-Omni2: https://arxiv.org/abs/2410.11190
[9] Emu3: https://arxiv.org/abs/2409.18869
[10] Chameleon: https://arxiv.org/abs/2405.09818
--
Homepage: https://www.shawhintalebi.com
Introduction -
0:00
Multimodal LLMs -
1:49
Path 1: LLM + Tools -
4:24
Path 2: LLM + Adapaters -
7:20
Path 3: Unified Models -
11:19
Example: LLaMA 3.2 for Vision Tasks (Ollama) -
13:24
What's next? -
19:58