Interpreting and running standardized language model benchmarks and evaluation datasets for both generalized and task specific performance assessments!
Resources:
lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
lm-evaluation-harness setup script: https://drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5pylXEcDp/view?usp=sharing
OpenLLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
YALL Leaderboard: https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard
MMLU Paper: https://arxiv.org/pdf/2009.03300
ARC Paper: https://arxiv.org/pdf/1803.05457
Orpo-Llama-3.2-1B-40k Model: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k
Chapters:
00:00 - Introduction
01:21 - What Are LLM Benchmarks? MMLU Example
05:09 - Additional Benchmark Examples
09:03 - How to Interpret Benchmark Evaluations
14:40 - Running Evaluations: Arc-Challenge Setup
16:49 - Running Evaluations: lm-evaluation-harness Repo
19:02 - Running Evaluations: CLI Environment Setup
21:42 - Running Evaluations: Defining lm-eval Arguments
24:27 - Running Evaluations: Starting Eval Run
26:49 - Running Evaluations: Interpreting Results
28:26 - Individual Implementation Differences
30:00 - Final Thoughts
#ai #datascience #programming