Models are evaluated on the following benchmarks.
Pass rate (PR@2) on Aider's polyglot benchmark
Accuracy on ARC-AGI-1
Accuracy on ARC-AGI-2
Score on Artificial Analysis Index benchmark
Overall accuracy on BFCL (function calling)
Elo score on EQ-Bench 3
Score on GPQA Diamond benchmark
Score on Scale's Humanity's Last Exam
Average score across LiveBench categories
Elo score on LMArena Text leaderboard
Accuracy on MathArena AIME 2025 competition
Accuracy on MathArena BRUMO 2025 competition
Accuracy on MathArena CMIMC 2025 competition
Accuracy on MathArena HMMT 2025 competition
Accuracy on MathArena SMT 2025 competition
Score on Scale's MMLU Pro benchmark
Score (AVG@5) on SimpleBench
Average accuracy across WeirdML tasks