CSPaper Benchmark: AI Models for CS Paper Reviews

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing OpenAI's GPT-5.1, GPT-5, GPT-4, O3, O4-mini and Google's Gemini 3.0 Pro, Gemini 2.5 across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

How we benchmark

Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings, after accounting for the venue-specific rating scales. It provides an interpretable, scale-agnostic measure of overall review accuracy, where lower NMAE values indicate better alignment between predicted and true review ratings.

Conference
OpenAIGPT-5
Geminigemini-2.5-pro
Geminigemini-3-pro-preview
OpenAIGPT-5.1
Geminigemini-2.5-flash
OpenAIGPT-4.1
OpenAIo3
OpenAIo4-mini
AAAI 2022 main technical0.079±0.1060.143±0.1660.048±0.0590.175±0.2390.159±0.1800.175±0.2390.111±0.1430.159±0.155
AAAI 2024 safe and robust AI0.044±0.0610.044±0.0990.267±0.4200.089±0.1450.044±0.0610.089±0.1220.067±0.0990.111±0.111
AAAI 2026 main technical0.085±0.0800.128±0.2680.077±0.0950.385±0.4440.256±0.2700.085±0.1290.196±0.2650.154±0.160
AAMAS 2024 main technical0.130±0.1090.037±0.0570.074±0.0570.093±0.1090.093±0.1300.037±0.0570.111±0.0990.111±0.070
ACL 2025 main0.083±0.1510.167±0.1510.125±0.1580.167±0.2190.354±0.3570.229±0.1660.125±0.1370.188±0.190
AISTATS 2025 main0.200±0.2090.200±0.2090.100±0.1370.200±0.2090.200±0.1120.200±0.1120.300±0.2090.150±0.137
CVPR 2025 main0.083±0.1290.125±0.2090.250±0.2740.333±0.3760.333±0.3760.375±0.2620.083±0.1290.125±0.137
EMNLP 2025 main0.104±0.0940.250±0.1370.208±0.1290.125±0.1370.167±0.1510.188±0.1720.188±0.1720.354±0.366
ICASSP 2026 regular paper0.233±0.2790.167±0.1670.300±0.2470.300±0.3420.300±0.0740.300±0.2470.233±0.2790.233±0.279
ICC 2025 technical symposia0.171±0.1250.314±0.3390.314±0.3290.143±0.0790.200±0.2080.314±0.3390.286±0.3390.314±0.339
ICLR 2025 main0.184±0.2690.120±0.1840.120±0.1390.145±0.1580.145±0.1720.145±0.2100.094±0.1000.248±0.308
ICLR 2026 main0.100±0.2000.225±0.0960.125±0.0960.075±0.0960.125±0.0960.175±0.0500.125±0.0960.375±0.419
ICME 2025 regular0.278±0.2020.278±0.2280.250±0.2300.250±0.2040.306±0.2450.250±0.1390.333±0.3330.333±0.333
ICML 2025 main0.229±0.1660.146±0.2000.146±0.1660.396±0.4210.312±0.3510.229±0.2000.229±0.2000.229±0.200
ICML 2025 position0.429±0.4260.429±0.3130.464±0.3040.536±0.4660.286±0.3660.321±0.3450.357±0.4530.607±0.497
IJCAI 2023 main0.407±0.4640.259±0.3700.241±0.3810.500±0.4260.426±0.4470.352±0.3940.500±0.4020.370±0.327
IJCAI 2023 survey0.188±0.2220.094±0.1290.125±0.1890.188±0.2910.219±0.1600.219±0.1600.156±0.1860.156±0.186
KDD 2025 research0.222±0.1720.222±0.2020.250±0.1750.222±0.2020.306±0.2670.250±0.2040.361±0.3710.389±0.390
NeurIPS 2025 datasets and benchmarks0.200±0.2190.200±0.2190.400±0.3580.200±0.1260.333±0.3930.367±0.3670.300±0.3740.433±0.463
NeurIPS 2025 main0.182±0.3030.191±0.1920.191±0.1920.118±0.1330.136±0.1910.173±0.1100.327±0.3610.309±0.383
SIGIR 2025 full paper0.250±0.1580.167±0.1290.167±0.3030.375±0.3450.208±0.2460.208±0.1880.208±0.1020.208±0.188
TheWebConf 2025 research0.111±0.0860.125±0.1470.306±0.3750.125±0.1560.153±0.1860.111±0.1140.125±0.1650.111±0.125
WACV 2026 main0.120±0.1100.080±0.1100.120±0.1100.400±0.4240.080±0.1790.160±0.0890.320±0.3900.160±0.167

Understanding NMAE

Interpreting the numbers

  • Lower NMAE = better alignment. Top-performing models stay below 0.2.
  • The ± values show standard deviation, reflecting model stability across papers.

How we calculate it

NMAE=1Ni=1NSgt(i)Spred(i)SmaxSmin\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

NN: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.

Sgt(i)S_{\text{gt}}^{(i)}: denotes the ground-truth overall rating of the i-th benchmark paper, representing the consensus quantitative evaluation from official or expert evaluations.

Spred(i)S_{\text{pred}}^{(i)}: represents the predicted overall rating produced by our CSPR agent for the same paper.

SmaxS_{\text{max}}&SminS_{\text{min}}: define the venue-specific upper and lower bounds of the rating scale, respectively (e.g., 5 and 1 for a 1–5 rating system).

Spearman Rank Correlation (SRC) measures how well the predicted ordering of paper ratings aligns with the true ordering. Unlike error-based metrics (e.g., NMAE), Spearman focuses on the relative ranking between papers, making it robust to scale differences and suitable for assessing whether the reviewer model correctly prioritizes stronger papers over weaker ones.

Benchmark in Progress

We are currently updating our dataset and evaluating new models. Please check back soon for the latest benchmark results.

Understanding SRC

What the values mean

A higher Spearman correlation indicates better agreement with the ground-truth ranking:

  • SRC = 1: perfect rank alignment
  • SRC = 0: no rank correlation
  • SRC = -1: perfectly reversed ranking

The formula

SRC=16i=1Ndi2N(N21)\text{SRC} = 1 - \frac{6 \sum_{i=1}^{N} d_i^2}{N(N^2 - 1)}

NN: number of benchmark papers

did_i: difference between ground-truth rank and predicted rank for the i-th paper

Why ranking matters

As a rank-based statistic, SRC complements NMAE by capturing ordering accuracy. Even when absolute score predictions vary across venues or models, a high Spearman correlation indicates strong reviewer consistency in distinguishing higher- from lower-quality submissions.

Average Word Count (AWC) reflects the level of verbosity and elaboration in the written reviews. While it does not measure correctness or ranking fidelity directly, word count serves as an important proxy for review completeness, depth of reasoning, and overall writing richness.

Conference
OpenAIGPT-5.1
Geminigemini-2.5-flash
OpenAIGPT-4.1
OpenAIGPT-5
Geminigemini-2.5-pro
OpenAIo4-mini
OpenAIo3
Geminigemini-3-pro-preview
AAAI 2022 main technical1522014689107061023714065862985685866
AAAI 2024 safe and robust AI14989151629957940111383663166668251
AAAI 2026 main technical145751494612741125031118212629129937140
AAMAS 2024 main technical1809417820128411432112581979171716580
ACL 2025 main192901578013081148411526311176124238057
AISTATS 2025 main250391974613724136701197612945146738529
CVPR 2025 main166591080610996118391244011567101068389
EMNLP 2025 main2117616803145881423815146596263588376
ICASSP 2026 regular paper240302052915627127501343612360109609688
ICC 2025 technical symposia165341306511537110709700642958288087
ICLR 2025 main158601356195909312106381087347486651
ICLR 2026 main230701475015739159661184111384125657739
ICME 2025 regular1763914381148131227911609622165209897
ICML 2025 main21814236421717117317188208232740511197
ICML 2025 position1607415865135461501512234374862286947
IJCAI 2023 main19720183411654213469138171191596239677
IJCAI 2023 survey2050410144112541145410101573652475142
KDD 2025 research1839612863120671081911069471448087740
NeurIPS 2025 datasets and benchmarks1905613947144951240914505437652928957
NeurIPS 2025 main2118520061124201231513745565756857582
SIGIR 2025 full paper196321743413771138621446312300135767729
TheWebConf 2025 research182781279012569117951179512562122416256
WACV 2026 main162311334413592140371220612771540310185

Understanding AWC

What it tells us

A higher average word count often corresponds to reviews that include more justification, detailed analysis, or structured reasoning. However, excessively long reviews may introduce noise or redundancy, whereas overly short ones may lack sufficient explanation.

How is it calculated

AWC=1Ni=1NWords(Rcspr(i))\text{AWC} = \frac{1}{N} \sum_{i=1}^{N} |\text{Words}(R_{\text{cspr}}^{(i)})|

NN: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.

Rcspr(i)R_{\text{cspr}}^{(i)}: full textual review generated for the i-th paper

|Words(.)|: counts the total number of tokenized words in the review

Putting it together

The average word count complements accuracy-oriented metrics such as NMAE and SRC by capturing stylistic characteristics of the generated reviews, revealing how concisely or thoroughly different models articulate their reasoning.

Want to learn more?

For further technical details, see our short paper accepted at INLG 2025 (18th International Conference on Natural Language Generation).

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review