What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own)

What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own)

3.343 Lượt nghe
What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own)
Interpreting and running standardized language model benchmarks and evaluation datasets for both generalized and task specific performance assessments! Resources: lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness lm-evaluation-harness setup script: https://drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5pylXEcDp/view?usp=sharing OpenLLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard YALL Leaderboard: https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard MMLU Paper: https://arxiv.org/pdf/2009.03300 ARC Paper: https://arxiv.org/pdf/1803.05457 Orpo-Llama-3.2-1B-40k Model: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k Chapters: 00:00 - Introduction 01:21 - What Are LLM Benchmarks? MMLU Example 05:09 - Additional Benchmark Examples 09:03 - How to Interpret Benchmark Evaluations 14:40 - Running Evaluations: Arc-Challenge Setup 16:49 - Running Evaluations: lm-evaluation-harness Repo 19:02 - Running Evaluations: CLI Environment Setup 21:42 - Running Evaluations: Defining lm-eval Arguments 24:27 - Running Evaluations: Starting Eval Run 26:49 - Running Evaluations: Interpreting Results 28:26 - Individual Implementation Differences 30:00 - Final Thoughts #ai #datascience #programming