AI Model Benchmark for CSPaper Reviews
We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing OpenAI's GPT-4, GPT-5, O3, O4-mini and Google's Gemini 2.5 across various top-tier computer science conferences.
Privacy and security first
Average NMAE by Conference and Model
This table shows the average NMAE (Normalized Mean Absolute Error) for each conference across different models. Lower values indicate better alignment with ground truth.
| Conference | GPT-4.1 | GPT-5 | o3 | o4-mini | gemini-2.5-flash | gemini-2.5-pro |
|---|---|---|---|---|---|---|
| AAAI 2026 main technical | 0.085±0.129 | 0.085±0.080 | 0.196±0.265 | 0.154±0.160 | 0.256±0.270 | 0.128±0.268 |
| ICASSP 2026 regular paper | 0.300±0.247 | 0.233±0.279 | 0.233±0.279 | 0.233±0.279 | 0.100±0.149 | 0.167±0.167 |
| ICLR 2026 main | 0.050±0.058 | 0.100±0.200 | 0.125±0.096 | 0.375±0.419 | 0.125±0.096 | 0.225±0.096 |
| WACV 2026 main | 0.160±0.089 | 0.120±0.110 | 0.040±0.089 | 0.160±0.167 | 0.040±0.089 | 0.080±0.110 |
| ACL 2025 main | 0.229±0.166 | 0.083±0.151 | 0.125±0.137 | 0.188±0.190 | 0.354±0.357 | 0.167±0.151 |
| AISTATS 2025 main | 0.200±0.112 | 0.200±0.209 | 0.300±0.209 | 0.200±0.137 | 0.200±0.112 | 0.150±0.209 |
| CVPR 2025 main | 0.375±0.262 | 0.083±0.129 | 0.083±0.129 | 0.125±0.137 | 0.333±0.376 | 0.125±0.209 |
| EMNLP 2025 main | 0.188±0.172 | 0.104±0.094 | 0.188±0.172 | 0.354±0.366 | 0.167±0.151 | 0.250±0.137 |
| ICLR 2025 main | 0.111±0.111 | 0.184±0.269 | 0.094±0.100 | 0.214±0.266 | 0.111±0.120 | 0.085±0.129 |
| ICML 2025 main | 0.229±0.200 | 0.229±0.166 | 0.104±0.123 | 0.229±0.200 | 0.312±0.351 | 0.146±0.200 |
| ICML 2025 position | 0.143±0.197 | 0.321±0.345 | 0.250±0.354 | 0.464±0.509 | 0.143±0.197 | 0.286±0.225 |
| KDD 2025 research | 0.250±0.204 | 0.222±0.172 | 0.361±0.371 | 0.389±0.390 | 0.306±0.267 | 0.222±0.202 |
| NeurIPS 2025 datasets and benchmarks | 0.167±0.234 | 0.200±0.219 | 0.300±0.374 | 0.433±0.463 | 0.333±0.393 | 0.200±0.219 |
| NeurIPS 2025 main | 0.173±0.110 | 0.182±0.303 | 0.327±0.361 | 0.309±0.383 | 0.136±0.191 | 0.191±0.192 |
| SIGIR 2025 full paper | 0.208±0.188 | 0.250±0.158 | 0.208±0.102 | 0.208±0.188 | 0.208±0.246 | 0.167±0.129 |
| TheWebConf 2025 research | 0.111±0.114 | 0.111±0.086 | 0.125±0.165 | 0.111±0.125 | 0.153±0.186 | 0.125±0.147 |
| AAAI 2024 safe and robust AI | 0.089±0.122 | 0.044±0.061 | 0.067±0.099 | 0.111±0.111 | 0.044±0.061 | 0.044±0.099 |
| AAMAS 2024 main technical | 0.037±0.057 | 0.130±0.109 | 0.111±0.099 | 0.111±0.070 | 0.093±0.130 | 0.037±0.057 |
| IJCAI 2023 main | 0.222±0.233 | 0.278±0.363 | 0.370±0.327 | 0.241±0.109 | 0.296±0.349 | 0.130±0.084 |
| AAAI 2022 main technical | 0.175±0.239 | 0.079±0.106 | 0.111±0.143 | 0.159±0.155 | 0.159±0.180 | 0.143±0.166 |
Understanding our main benchmarking metric: NMAE
What is NMAE?
The table reports Normalized Mean Absolute Error (NMAE)—a scale-independent measure of how closely our AI-generated review scores match the ground-truth ratings from conference reviewers. A value of 0 means perfect agreement; higher numbers indicate larger deviations.
Interpreting the numbers
- Lower NMAE = better alignment. Top-performing models consistently stay below 0.08 across conferences.
- The ± values capture the standard deviation over multiple benchmark papers, reflecting how stable the model is from paper to paper.
Formula and components
Terms
- : number of benchmark papers
- : ground-truth rating
- : AI-predicted rating
- &: upper and lower bounds of the venue's rating scale
- Normalization lets us compare conferences with different scoring rubrics on a single scale.
Want to learn more?
For further technical details, see our short paper accepted at INLG 2025 (18th International Conference on Natural Language Generation).
Get high-quality reviews powered by the best AI models
We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.
Start your free review