What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own)

3.343 Lượt nghe

00:00

Update Required To play the media you will need to either update your browser to a recent version or update your Flash plugin.

Tải MP3

MÔ TẢ MP3TIẾP THEO

What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own)

Interpreting and running standardized language model benchmarks and evaluation datasets for both generalized  and task specific performance assessments!

Resources:
lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
lm-evaluation-harness setup script: https://drive.google.com/file/d/1oWoWSBUdCiB82R-8m52nv_-5pylXEcDp/view?usp=sharing
OpenLLM Leaderboard: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
YALL Leaderboard: https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard
MMLU Paper: https://arxiv.org/pdf/2009.03300
ARC Paper: https://arxiv.org/pdf/1803.05457
Orpo-Llama-3.2-1B-40k Model: https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k

Chapters:
00:00 - Introduction
01:21 - What Are LLM Benchmarks? MMLU Example
05:09 - Additional Benchmark Examples
09:03 - How to Interpret Benchmark Evaluations
14:40 - Running Evaluations: Arc-Challenge Setup
16:49 - Running Evaluations: lm-evaluation-harness Repo
19:02 - Running Evaluations: CLI Environment Setup
21:42 - Running Evaluations: Defining lm-eval Arguments
24:27 - Running Evaluations: Starting Eval Run
26:49 - Running Evaluations: Interpreting Results
28:26 - Individual Implementation Differences
30:00 - Final Thoughts

#ai #datascience #programming					

What Do LLM Benchmarks Actually Tell Us? (+ How to Run Your Own)

Nhạc Theo Chủ Đề

Liên kết website