Benchmarks

Models are evaluated on the following benchmarks.

Benchmark	Links	Score weight	Cost weight	Models	Cost data?	Private holdout?
Aider Polyglot Pass rate (PR@2) on Aider's polyglot benchmark	Website GitHub	1	1	34	Yes	No
ARC-AGI-1 Accuracy on ARC-AGI-1	Website GitHub	1	1	43	Yes	Yes
ARC-AGI-2 Accuracy on ARC-AGI-2	Website GitHub	1	1	43	Yes	Yes
Artificial Analysis Index Score on Artificial Analysis Index benchmark	Website	1	1	51	Yes	No
Berkeley Function Call Leaderboard (BFCL) Overall accuracy on BFCL (function calling)	Website	1	1	20	Yes	Yes
EQ-Bench 3 Elo score on EQ-Bench 3	Website GitHub	1	1	26	No	No
GPQA Diamond Score on GPQA Diamond benchmark	Website	1	1	49	Yes	No
Humanity's Last Exam Score on Scale's Humanity's Last Exam	Website	1	1	49	Yes	No
LiveBench Average score across LiveBench categories	Website GitHub	1	1	45	No	Yes
LMArena Text Elo score on LMArena Text leaderboard	Website	1	1	34	No	Yes
MathArena AIME 2025 Accuracy on MathArena AIME 2025 competition	Website GitHub	1	1	27	Yes	No
MathArena BRUMO 2025 Accuracy on MathArena BRUMO 2025 competition	Website GitHub	1	1	23	Yes	No
MathArena CMIMC 2025 Accuracy on MathArena CMIMC 2025 competition	Website GitHub	1	1	18	Yes	No
MathArena HMMT 2025 Accuracy on MathArena HMMT 2025 competition	Website GitHub	1	1	27	Yes	No
MathArena SMT 2025 Accuracy on MathArena SMT 2025 competition	Website GitHub	1	1	21	Yes	No
MMLU Pro Score on Scale's MMLU Pro benchmark	Website	1	1	48	Yes	No
SimpleBench Score (AVG@5) on SimpleBench	Website GitHub	1	1	24	No	Yes
WeirdML Average accuracy across WeirdML tasks	Website	1	1	30	Yes	No