AI Model Benchmark for CSPaper Reviews

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing OpenAI's GPT-4, GPT-5, O3, O4-mini and Google's Gemini 2.5 across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

Average NMAE by Conference and Model

This table shows the average NMAE (Normalized Mean Absolute Error) for each conference across different models. Lower values indicate better alignment with ground truth.

Conference
OpenAIGPT-4.1
OpenAIGPT-5
OpenAIo3
OpenAIo4-mini
Geminigemini-2.5-flash
Geminigemini-2.5-pro
AAAI 2026 main technical0.085±0.1290.085±0.0800.196±0.2650.154±0.1600.256±0.2700.128±0.268
ICASSP 2026 regular paper0.300±0.2470.233±0.2790.233±0.2790.233±0.2790.100±0.1490.167±0.167
ICLR 2026 main0.050±0.0580.100±0.2000.125±0.0960.375±0.4190.125±0.0960.225±0.096
WACV 2026 main0.160±0.0890.120±0.1100.040±0.0890.160±0.1670.040±0.0890.080±0.110
ACL 2025 main0.229±0.1660.083±0.1510.125±0.1370.188±0.1900.354±0.3570.167±0.151
AISTATS 2025 main0.200±0.1120.200±0.2090.300±0.2090.200±0.1370.200±0.1120.150±0.209
CVPR 2025 main0.375±0.2620.083±0.1290.083±0.1290.125±0.1370.333±0.3760.125±0.209
EMNLP 2025 main0.188±0.1720.104±0.0940.188±0.1720.354±0.3660.167±0.1510.250±0.137
ICLR 2025 main0.111±0.1110.184±0.2690.094±0.1000.214±0.2660.111±0.1200.085±0.129
ICML 2025 main0.229±0.2000.229±0.1660.104±0.1230.229±0.2000.312±0.3510.146±0.200
ICML 2025 position0.143±0.1970.321±0.3450.250±0.3540.464±0.5090.143±0.1970.286±0.225
KDD 2025 research0.250±0.2040.222±0.1720.361±0.3710.389±0.3900.306±0.2670.222±0.202
NeurIPS 2025 datasets and benchmarks0.167±0.2340.200±0.2190.300±0.3740.433±0.4630.333±0.3930.200±0.219
NeurIPS 2025 main0.173±0.1100.182±0.3030.327±0.3610.309±0.3830.136±0.1910.191±0.192
SIGIR 2025 full paper0.208±0.1880.250±0.1580.208±0.1020.208±0.1880.208±0.2460.167±0.129
TheWebConf 2025 research0.111±0.1140.111±0.0860.125±0.1650.111±0.1250.153±0.1860.125±0.147
AAAI 2024 safe and robust AI0.089±0.1220.044±0.0610.067±0.0990.111±0.1110.044±0.0610.044±0.099
AAMAS 2024 main technical0.037±0.0570.130±0.1090.111±0.0990.111±0.0700.093±0.1300.037±0.057
IJCAI 2023 main0.222±0.2330.278±0.3630.370±0.3270.241±0.1090.296±0.3490.130±0.084
AAAI 2022 main technical0.175±0.2390.079±0.1060.111±0.1430.159±0.1550.159±0.1800.143±0.166

Understanding our main benchmarking metric: NMAE

What is NMAE?

The table reports Normalized Mean Absolute Error (NMAE)—a scale-independent measure of how closely our AI-generated review scores match the ground-truth ratings from conference reviewers. A value of 0 means perfect agreement; higher numbers indicate larger deviations.

Interpreting the numbers

  • Lower NMAE = better alignment. Top-performing models consistently stay below 0.08 across conferences.
  • The ± values capture the standard deviation over multiple benchmark papers, reflecting how stable the model is from paper to paper.

Formula and components

NMAE=1Ni=1NSgt(i)Spred(i)SmaxSmin\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

Terms

  • NN: number of benchmark papers
  • Sgt(i)S_{\text{gt}}^{(i)}: ground-truth rating
  • Spred(i)S_{\text{pred}}^{(i)}: AI-predicted rating
  • SmaxS_{\text{max}}&SminS_{\text{min}}: upper and lower bounds of the venue's rating scale
  • Normalization lets us compare conferences with different scoring rubrics on a single scale.

Want to learn more?

For further technical details, see our short paper accepted at INLG 2025 (18th International Conference on Natural Language Generation).

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review