Models are evaluated on the following benchmarks.
Pass rate (PR@2) on Aider's polyglot benchmark
Accuracy on ARC-AGI-1
Accuracy on ARC-AGI-2
Score on Artificial Analysis Index benchmark
Elo score on EQ-Bench 3
Score on GPQA Diamond benchmark
Score on Scale's Humanity's Last Exam
Average score across LiveBench categories
Elo score on LMArena Text leaderboard
Score on Scale's MMLU Pro benchmark
Score (AVG@5) on SimpleBench
Average accuracy across WeirdML tasks