Yilun Zhao - MMVU Measuring Expert Level Multidiscipline Video Understanding

84 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

Yilun Zhao - MMVU Measuring Expert Level Multidiscipline Video Understanding

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.

I am a CS PhD student at Yale working with Professor Arman Cohan on Natural Language Processing and Large Language Model. My current research focuses on (1) AI4Research and (2) expert-level, knowledge-intensive, (multimodal) reasoning (in specialized domains).

This session is brought to you by the Cohere For AI Open Science Community -  a space where ML researchers, engineers, linguists, social scientists, and lifelong learners connect and collaborate with each other. We'd like to extend a special thank you to Ahmad Anis, Lead of our Geo Regional Asia group for their dedication in organizing this event. 

If you’re interested in sharing your work, we welcome you to join us! Simply fill out the form at https://forms.gle/ALND9i6KouEEpCnz6 to express your interest in becoming a speaker. 

Join the Cohere For AI Open Science Community to see a full list of upcoming events (https://tinyurl.com/C4AICommunityApp).

Yilun Zhao - MMVU Measuring Expert Level Multidiscipline Video Understanding

Nhạc Theo Chủ Đề

Liên kết website