Benchmarks

Models are evaluated on the following benchmarks.

BenchmarkLinksScore weightCost weightModelsCost data?Private holdout?Details
Aider Polyglot

Pass rate (PR@2) on Aider's polyglot benchmark

1134YesNo
ARC-AGI-1

Accuracy on ARC-AGI-1

1143YesYes
ARC-AGI-2

Accuracy on ARC-AGI-2

1143YesYes
Artificial Analysis Index

Score on Artificial Analysis Index benchmark

1151YesNo
Berkeley Function Call Leaderboard (BFCL)

Overall accuracy on BFCL (function calling)

1120YesYes
EQ-Bench 3

Elo score on EQ-Bench 3

1126NoNo
GPQA Diamond

Score on GPQA Diamond benchmark

1149YesNo
Humanity's Last Exam

Score on Scale's Humanity's Last Exam

1149YesNo
LiveBench

Average score across LiveBench categories

1145NoYes
LMArena Text

Elo score on LMArena Text leaderboard

1134NoYes
MathArena AIME 2025

Accuracy on MathArena AIME 2025 competition

1127YesNo
MathArena BRUMO 2025

Accuracy on MathArena BRUMO 2025 competition

1123YesNo
MathArena CMIMC 2025

Accuracy on MathArena CMIMC 2025 competition

1118YesNo
MathArena HMMT 2025

Accuracy on MathArena HMMT 2025 competition

1127YesNo
MathArena SMT 2025

Accuracy on MathArena SMT 2025 competition

1121YesNo
MMLU Pro

Score on Scale's MMLU Pro benchmark

1148YesNo
SimpleBench

Score (AVG@5) on SimpleBench

1124NoYes
WeirdML

Average accuracy across WeirdML tasks

1130YesNo