Caltech CV4E - Distribution Shifts, Data Poisoning, Train/Val/Test Split - Julia Chae

Caltech CV4E - Distribution Shifts, Data Poisoning, Train/Val/Test Split - Julia Chae

276 Lượt nghe
Caltech CV4E - Distribution Shifts, Data Poisoning, Train/Val/Test Split - Julia Chae
2025 Computer Vision for Ecology Workshop at Caltech - Lecture 2 MIT CSAIL PhD Candidate Julia Chae shares guidelines for processing a machine learning-ready dataset. Her lecture covers how the ideal dataset looks like and common pitfalls in splitting the dataset into training, validation, and test set. These pitfalls include data poising, unaddressed domain shifts, and other generalization issues. This lecture is part of a summer workshop at Caltech that teaches PhDs and postdocs in ecology how to apply computer vision to their own research projects. Over three weeks, the students implement a computer vision algorithm, for example, to count walruses from space, detect invasive rats, or identify which gorilla is beating their chest. These lectures guide them through those three weeks. See https://cv4ecology.caltech.edu for more information. Edited by Björn Lütjens. ⛆ Contents ⛆ 🐈‍⬛ 0:00 - Definition of a Computer Vision task 🐈‍⬛ 1:17 - Ideal Qualities of a Dataset 🐈‍⬛ 4:47 - Why we need a training, validation, and test set 🐈‍⬛ 11:45 - Benchmark dataset 🐈‍⬛ 13:25 - Pitfalls | Introduction 🐈‍⬛ 15:11 - Pitfalls | Data Poisoning 🐈‍⬛ 21:41 - Pitfalls | Extrapolation and out-of-domain 🐈‍⬛ 23:20 - Pitfalls | Real-world distribution shifts 🐈‍⬛ 34:15 - Think-pair-share activity 1 🐈‍⬛ 35:44 - Data organization 🐈‍⬛ 36:51 - Dataset rules of thumb 🐈‍⬛ 40:30 - Overcoming distribution shifts 🐈‍⬛ 48:02 - Think-pair-share activity 2