AI Model Benchmark for CSPaper Reviews

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing OpenAI's GPT-4, GPT-5, O3, O4-mini and Google's Gemini 2.5 across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

Average NMAE by Conference and Model

This table shows the average NMAE (Normalized Mean Absolute Error) for each conference across different models. Lower values indicate better alignment with ground truth.

Conference	GPT-4.1	GPT-5	o3	o4-mini	gemini-2.5-flash	gemini-2.5-pro
AAAI 2026 main technical	0.085±0.129	0.085±0.080	0.196±0.265	0.154±0.160	0.256±0.270	0.128±0.268
ICASSP 2026 regular paper	0.300±0.247	0.233±0.279	0.233±0.279	0.233±0.279	0.100±0.149	0.167±0.167
ICLR 2026 main	0.050±0.058	0.100±0.200	0.125±0.096	0.375±0.419	0.125±0.096	0.225±0.096
WACV 2026 main	0.160±0.089	0.120±0.110	0.040±0.089	0.160±0.167	0.040±0.089	0.080±0.110
ACL 2025 main	0.229±0.166	0.083±0.151	0.125±0.137	0.188±0.190	0.354±0.357	0.167±0.151
AISTATS 2025 main	0.200±0.112	0.200±0.209	0.300±0.209	0.200±0.137	0.200±0.112	0.150±0.209
CVPR 2025 main	0.375±0.262	0.083±0.129	0.083±0.129	0.125±0.137	0.333±0.376	0.125±0.209
EMNLP 2025 main	0.188±0.172	0.104±0.094	0.188±0.172	0.354±0.366	0.167±0.151	0.250±0.137
ICLR 2025 main	0.111±0.111	0.184±0.269	0.094±0.100	0.214±0.266	0.111±0.120	0.085±0.129
ICML 2025 main	0.229±0.200	0.229±0.166	0.104±0.123	0.229±0.200	0.312±0.351	0.146±0.200
ICML 2025 position	0.143±0.197	0.321±0.345	0.250±0.354	0.464±0.509	0.143±0.197	0.286±0.225
KDD 2025 research	0.250±0.204	0.222±0.172	0.361±0.371	0.389±0.390	0.306±0.267	0.222±0.202
NeurIPS 2025 datasets and benchmarks	0.167±0.234	0.200±0.219	0.300±0.374	0.433±0.463	0.333±0.393	0.200±0.219
NeurIPS 2025 main	0.173±0.110	0.182±0.303	0.327±0.361	0.309±0.383	0.136±0.191	0.191±0.192
SIGIR 2025 full paper	0.208±0.188	0.250±0.158	0.208±0.102	0.208±0.188	0.208±0.246	0.167±0.129
TheWebConf 2025 research	0.111±0.114	0.111±0.086	0.125±0.165	0.111±0.125	0.153±0.186	0.125±0.147
AAAI 2024 safe and robust AI	0.089±0.122	0.044±0.061	0.067±0.099	0.111±0.111	0.044±0.061	0.044±0.099
AAMAS 2024 main technical	0.037±0.057	0.130±0.109	0.111±0.099	0.111±0.070	0.093±0.130	0.037±0.057
IJCAI 2023 main	0.222±0.233	0.278±0.363	0.370±0.327	0.241±0.109	0.296±0.349	0.130±0.084
AAAI 2022 main technical	0.175±0.239	0.079±0.106	0.111±0.143	0.159±0.155	0.159±0.180	0.143±0.166

Understanding our main benchmarking metric: NMAE

What is NMAE?

The table reports Normalized Mean Absolute Error (NMAE)—a scale-independent measure of how closely our AI-generated review scores match the ground-truth ratings from conference reviewers. A value of 0 means perfect agreement; higher numbers indicate larger deviations.

Interpreting the numbers

Lower NMAE = better alignment. Top-performing models consistently stay below 0.08 across conferences.
The ± values capture the standard deviation over multiple benchmark papers, reflecting how stable the model is from paper to paper.

Formula and components

\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

Terms

$N$ : number of benchmark papers
$S_{\text{gt}}^{(i)}$ : ground-truth rating
$S_{\text{pred}}^{(i)}$ : AI-predicted rating

$S_{\text{max}}$ & $S_{\text{min}}$ : upper and lower bounds of the venue's rating scale
Normalization lets us compare conferences with different scoring rubrics on a single scale.

Want to learn more?

For further technical details, see our short paper accepted at INLG 2025 (18th International Conference on Natural Language Generation).

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review