Pax Arena Methodology

How we rank AI models through blind pairwise comparisons

Questions or concerns about our methodology?

We welcome scrutiny of our ranking approach. If you believe there is an issue with how models are evaluated, ranked, or compared, or if you have suggestions for improvement, please reach out.

Ping @development on Discord eli@paxhistoria.co

Overview

Pax Arena rankings are calculated using Bradley-Terry Maximum Likelihood Estimation with Fisher Information-based confidence intervals. When you play a game on Pax Historia, the AI models that generated the content are compared through blind A/B evaluations — you see the outputs side by side without knowing which model produced which, and your preference is recorded as a vote.

Rather than comparing models randomly, our system uses A-optimal active learning (Fisher-Greedy selection) to choose which pairs to compare next. This prioritizes comparisons that most efficiently reduce overall uncertainty — particularly targeting newly introduced models and close matchups where the ranking is still uncertain.

Vote Filtering

Not all votes are counted toward rankings. If either model's response is blank, contains invalid JSON, or produces zero events, that comparison is excluded. This means the leaderboard measures model quality, not model reliability — a model that crashes 50% of the time but produces excellent output the other 50% will still rank well here. Reliability is tracked separately and is a key factor in which models we promote to production.

Score Display

Scores shown on the leaderboard are log-odds values anchored to a reference model (currently Grok 4 Fast, which is fixed at 0). A positive score means a model performs better than the reference; negative means worse. The key thing to look at is the relative differencebetween models — a 1-unit gap corresponds to roughly a 73% win probability in a head-to-head comparison. The 95% confidence intervals indicate how certain we are about each model's position.

Pax Arena Methodology

Questions or concerns about our methodology?

Overview

Vote Filtering

Score Display

The ProblemSample-efficient uncertainty reduction in dynamic pairwise ranking — a formal description for algorithm development

Pax Arena Methodology

Questions or concerns about our methodology?

Overview

Vote Filtering

Score Display

The ProblemSample-efficient uncertainty reduction in dynamic pairwise ranking — a formal description for algorithm development