Benchmarks

Models are evaluated on the following benchmarks.

BenchmarkLinksScore weightCost weightModelsCost data?Private holdout?Details
Aider Polyglot

Pass rate (PR@2) on Aider's polyglot benchmark

1134YesNo
ARC-AGI-1

Accuracy on ARC-AGI-1

1142YesYes
ARC-AGI-2

Accuracy on ARC-AGI-2

1142YesYes
Artificial Analysis Index

Score on Artificial Analysis Index benchmark

1143YesNo
EQ-Bench 3

Elo score on EQ-Bench 3

1124NoNo
GPQA Diamond

Score on GPQA Diamond benchmark

1143YesNo
Humanity's Last Exam

Score on Scale's Humanity's Last Exam

1143YesNo
LiveBench

Average score across LiveBench categories

1132NoYes
LMArena Text

Elo score on LMArena Text leaderboard

1134NoYes
MMLU Pro

Score on Scale's MMLU Pro benchmark

1142YesNo
SimpleBench

Score (AVG@5) on SimpleBench

1124NoYes
WeirdML

Average accuracy across WeirdML tasks

1127YesNo