LLM Chronicles #6.3: Multi-Modal LLMs for Image, Sound and Video

27.659 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

LLM Chronicles #6.3: Multi-Modal LLMs for Image, Sound and Video

In this episode we look at the architecture and training of multi-modal LLMs. After that, we’ll focus on vision and explore Vision Transformers and how they are trained with contrastive learning (OpenAI's CLIP and Google's SigLIP). Vision Transformers are the most commonly used building block in MLLMs with vision capabilities. Finally, we’ll get hands-on and look into Google’s open-weight PaliGemma, analysing its implementation to see these concepts in action within a real-world multi-modal LLM.

Series website: https://llm-chronicles.com/

🖹 Canvas and Colab Notebook: 
- LLM Limitations and Challenges: https://llm-chronicles.com/pdfs/llm-chronicles-6.3_multi-modal-llm.pdf
- Colab Notebook: https://colab.research.google.com/drive/1wEkBQcYq8-xsyGlvgxpXQxjP_yN9F0FU?usp=sharing

🕤 Timestamps:
01:32 - MLLM Architecture
03:49 - Training MLLMs
07:02 - Vision Transformer
09:24 - Contrastive Learning (CLIP, SigLIP)
12:35 - Lab: PaliGemma
22:53 - Summary

References:
- Vision transformer: https://arxiv.org/pdf/2010.11929
- Survey of multi modal LLMs: https://arxiv.org/pdf/2306.13549 
- Microsoft's CLAP: https://arxiv.org/pdf/2206.04769 
- SigLip: https://arxiv.org/pdf/2303.15343					

LLM Chronicles #6.3: Multi-Modal LLMs for Image, Sound and Video

Nhạc Theo Chủ Đề

Liên kết website