About

Overview of this project and how average scores are computed.

Goal

The purpose of this leaderboard is not to crown the best language model for a specific task. Instead it aggregates results from popular benchmarks so you can quickly see which lab is currently ahead across the evaluations that the community follows.

Methodology

Models are evaluated using a collection of public benchmarks. Raw scores are scraped from official leaderboards and documentation.

For each benchmark we determine the minimum and maximum score across all models. Individual scores are normalised using min–max normalisation and then averaged to produce the leaderboard ranking.

normalized_i,j = (score_i,j − min_j) / (max_j − min_j)
average_i = (1 / B)∑_j normalized_i,j× 100

This approach ensures that benchmarks with different scales contribute equally to the final average.

Some benchmarks also publish pricing information for completing a single task. When available, these values are used to compute a Cost Per Task metric shown on the leaderboard.

Costs are normalised separately from accuracy scores. Because many leaderboards omit pricing for some models, we fit a rank‑1 matrix factorisation using singular value decomposition to estimate a weight for each benchmark. Each model’s cost is multiplied by the inverse of this weight before averaging.

factor_j = 1 / u_j
cost′_i,j = cost_i,j × factor_j
CPT_i = (1 / B)∑_j cost′_i,j

Each model’s cost per task figure is then the mean of its normalised costs across the available benchmarks, allowing fair comparison across tasks with different price scales.

Why Cost Per Task?

Pricing is often reported per thousand output tokens, but the amount of text produced by different benchmarks varies widely. Epoch AI’s analysis highlights just how large these differences can be.

Expressing prices as a total cost per task avoids distorted comparisons and reflects a better estimate of how much these models cost to run on real-world tasks.