We will fine-tune VLMs to chat with images using Python! Specifically, we'll fine-tune the Qwen2-VL-7B-Instruct model using LoRA and 4-bit quantization. GitHub below ↓
Want to support the channel? Hit that like button and subscribe!
GitHub Link of the Code
https://github.com/uygarkurt/Fine-Tune-VLMs
Qwen2-VL-7B Model
https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct
Dataset
https://huggingface.co/datasets/HuggingFaceM4/ChartQA
What should I implement next? Let me know in the comments!
00:00 Introduction
00:50 Install Necessary Libraries
01:49 Imports
03:40 Hyperparameter Definitions
08:12 Dataset Preparation
22:38 Load VL Model and Processor
25:06 Sample Inference
32:18 Configure LoRA
34:05 Training Arguments Configuration
35:42 Data Collator
39:03 Configure Trainer
39:55 Start the VLM Training
40:42 After Training Inference and Evaluation
References
https://huggingface.co/learn/cookbook/en/fine_tuning_vlm_trl
https://huggingface.co/docs/trl/en/sft_trainer
https://huggingface.co/docs/transformers/main/en/tasks/visual_question_answering
Buy me a coffee! ☕️
https://ko-fi.com/uygarkurt