CSPaper Benchmark: AI Models for CS Paper Reviews
We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing OpenAI's GPT-5.1, GPT-5, GPT-4, O3, O4-mini and Google's Gemini 3.0 Pro, Gemini 2.5 across various top-tier computer science conferences.
Privacy and security first
How we benchmark
Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings, after accounting for the venue-specific rating scales. It provides an interpretable, scale-agnostic measure of overall review accuracy, where lower NMAE values indicate better alignment between predicted and true review ratings.
| Conference | GPT-5 | gemini-2.5-pro | gemini-3-pro-preview | GPT-5.1 | gemini-2.5-flash | GPT-4.1 | o3 | o4-mini |
|---|---|---|---|---|---|---|---|---|
| AAAI 2022 main technical | 0.079±0.106 | 0.143±0.166 | 0.048±0.059 | 0.175±0.239 | 0.159±0.180 | 0.175±0.239 | 0.111±0.143 | 0.159±0.155 |
| AAAI 2024 safe and robust AI | 0.044±0.061 | 0.044±0.099 | 0.267±0.420 | 0.089±0.145 | 0.044±0.061 | 0.089±0.122 | 0.067±0.099 | 0.111±0.111 |
| AAAI 2026 main technical | 0.085±0.080 | 0.128±0.268 | 0.077±0.095 | 0.385±0.444 | 0.256±0.270 | 0.085±0.129 | 0.196±0.265 | 0.154±0.160 |
| AAMAS 2024 main technical | 0.130±0.109 | 0.037±0.057 | 0.074±0.057 | 0.093±0.109 | 0.093±0.130 | 0.037±0.057 | 0.111±0.099 | 0.111±0.070 |
| ACL 2025 main | 0.083±0.151 | 0.167±0.151 | 0.125±0.158 | 0.167±0.219 | 0.354±0.357 | 0.229±0.166 | 0.125±0.137 | 0.188±0.190 |
| AISTATS 2025 main | 0.200±0.209 | 0.200±0.209 | 0.100±0.137 | 0.200±0.209 | 0.200±0.112 | 0.200±0.112 | 0.300±0.209 | 0.150±0.137 |
| CVPR 2025 main | 0.083±0.129 | 0.125±0.209 | 0.250±0.274 | 0.333±0.376 | 0.333±0.376 | 0.375±0.262 | 0.083±0.129 | 0.125±0.137 |
| EMNLP 2025 main | 0.104±0.094 | 0.250±0.137 | 0.208±0.129 | 0.125±0.137 | 0.167±0.151 | 0.188±0.172 | 0.188±0.172 | 0.354±0.366 |
| ICASSP 2026 regular paper | 0.233±0.279 | 0.167±0.167 | 0.300±0.247 | 0.300±0.342 | 0.300±0.074 | 0.300±0.247 | 0.233±0.279 | 0.233±0.279 |
| ICC 2025 technical symposia | 0.171±0.125 | 0.314±0.339 | 0.314±0.329 | 0.143±0.079 | 0.200±0.208 | 0.314±0.339 | 0.286±0.339 | 0.314±0.339 |
| ICLR 2025 main | 0.184±0.269 | 0.120±0.184 | 0.120±0.139 | 0.145±0.158 | 0.145±0.172 | 0.145±0.210 | 0.094±0.100 | 0.248±0.308 |
| ICLR 2026 main | 0.100±0.200 | 0.225±0.096 | 0.125±0.096 | 0.075±0.096 | 0.125±0.096 | 0.175±0.050 | 0.125±0.096 | 0.375±0.419 |
| ICME 2025 regular | 0.278±0.202 | 0.278±0.228 | 0.250±0.230 | 0.250±0.204 | 0.306±0.245 | 0.250±0.139 | 0.333±0.333 | 0.333±0.333 |
| ICML 2025 main | 0.229±0.166 | 0.146±0.200 | 0.146±0.166 | 0.396±0.421 | 0.312±0.351 | 0.229±0.200 | 0.229±0.200 | 0.229±0.200 |
| ICML 2025 position | 0.429±0.426 | 0.429±0.313 | 0.464±0.304 | 0.536±0.466 | 0.286±0.366 | 0.321±0.345 | 0.357±0.453 | 0.607±0.497 |
| IJCAI 2023 main | 0.407±0.464 | 0.259±0.370 | 0.241±0.381 | 0.500±0.426 | 0.426±0.447 | 0.352±0.394 | 0.500±0.402 | 0.370±0.327 |
| IJCAI 2023 survey | 0.188±0.222 | 0.094±0.129 | 0.125±0.189 | 0.188±0.291 | 0.219±0.160 | 0.219±0.160 | 0.156±0.186 | 0.156±0.186 |
| KDD 2025 research | 0.222±0.172 | 0.222±0.202 | 0.250±0.175 | 0.222±0.202 | 0.306±0.267 | 0.250±0.204 | 0.361±0.371 | 0.389±0.390 |
| NeurIPS 2025 datasets and benchmarks | 0.200±0.219 | 0.200±0.219 | 0.400±0.358 | 0.200±0.126 | 0.333±0.393 | 0.367±0.367 | 0.300±0.374 | 0.433±0.463 |
| NeurIPS 2025 main | 0.182±0.303 | 0.191±0.192 | 0.191±0.192 | 0.118±0.133 | 0.136±0.191 | 0.173±0.110 | 0.327±0.361 | 0.309±0.383 |
| SIGIR 2025 full paper | 0.250±0.158 | 0.167±0.129 | 0.167±0.303 | 0.375±0.345 | 0.208±0.246 | 0.208±0.188 | 0.208±0.102 | 0.208±0.188 |
| TheWebConf 2025 research | 0.111±0.086 | 0.125±0.147 | 0.306±0.375 | 0.125±0.156 | 0.153±0.186 | 0.111±0.114 | 0.125±0.165 | 0.111±0.125 |
| WACV 2026 main | 0.120±0.110 | 0.080±0.110 | 0.120±0.110 | 0.400±0.424 | 0.080±0.179 | 0.160±0.089 | 0.320±0.390 | 0.160±0.167 |
Understanding NMAE
Interpreting the numbers
- Lower NMAE = better alignment. Top-performing models stay below 0.2.
- The ± values show standard deviation, reflecting model stability across papers.
How we calculate it
: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.
: denotes the ground-truth overall rating of the i-th benchmark paper, representing the consensus quantitative evaluation from official or expert evaluations.
: represents the predicted overall rating produced by our CSPR agent for the same paper.
&: define the venue-specific upper and lower bounds of the rating scale, respectively (e.g., 5 and 1 for a 1–5 rating system).
Spearman Rank Correlation (SRC) measures how well the predicted ordering of paper ratings aligns with the true ordering. Unlike error-based metrics (e.g., NMAE), Spearman focuses on the relative ranking between papers, making it robust to scale differences and suitable for assessing whether the reviewer model correctly prioritizes stronger papers over weaker ones.
Benchmark in Progress
We are currently updating our dataset and evaluating new models. Please check back soon for the latest benchmark results.
Understanding SRC
What the values mean
A higher Spearman correlation indicates better agreement with the ground-truth ranking:
- SRC = 1: perfect rank alignment
- SRC = 0: no rank correlation
- SRC = -1: perfectly reversed ranking
The formula
: number of benchmark papers
: difference between ground-truth rank and predicted rank for the i-th paper
Why ranking matters
As a rank-based statistic, SRC complements NMAE by capturing ordering accuracy. Even when absolute score predictions vary across venues or models, a high Spearman correlation indicates strong reviewer consistency in distinguishing higher- from lower-quality submissions.
Average Word Count (AWC) reflects the level of verbosity and elaboration in the written reviews. While it does not measure correctness or ranking fidelity directly, word count serves as an important proxy for review completeness, depth of reasoning, and overall writing richness.
| Conference | GPT-5.1 | gemini-2.5-flash | GPT-4.1 | GPT-5 | gemini-2.5-pro | o4-mini | o3 | gemini-3-pro-preview |
|---|---|---|---|---|---|---|---|---|
| AAAI 2022 main technical | 15220 | 14689 | 10706 | 10237 | 14065 | 8629 | 8568 | 5866 |
| AAAI 2024 safe and robust AI | 14989 | 15162 | 9957 | 9401 | 11383 | 6631 | 6666 | 8251 |
| AAAI 2026 main technical | 14575 | 14946 | 12741 | 12503 | 11182 | 12629 | 12993 | 7140 |
| AAMAS 2024 main technical | 18094 | 17820 | 12841 | 14321 | 12581 | 9791 | 7171 | 6580 |
| ACL 2025 main | 19290 | 15780 | 13081 | 14841 | 15263 | 11176 | 12423 | 8057 |
| AISTATS 2025 main | 25039 | 19746 | 13724 | 13670 | 11976 | 12945 | 14673 | 8529 |
| CVPR 2025 main | 16659 | 10806 | 10996 | 11839 | 12440 | 11567 | 10106 | 8389 |
| EMNLP 2025 main | 21176 | 16803 | 14588 | 14238 | 15146 | 5962 | 6358 | 8376 |
| ICASSP 2026 regular paper | 24030 | 20529 | 15627 | 12750 | 13436 | 12360 | 10960 | 9688 |
| ICC 2025 technical symposia | 16534 | 13065 | 11537 | 11070 | 9700 | 6429 | 5828 | 8087 |
| ICLR 2025 main | 15860 | 13561 | 9590 | 9312 | 10638 | 10873 | 4748 | 6651 |
| ICLR 2026 main | 23070 | 14750 | 15739 | 15966 | 11841 | 11384 | 12565 | 7739 |
| ICME 2025 regular | 17639 | 14381 | 14813 | 12279 | 11609 | 6221 | 6520 | 9897 |
| ICML 2025 main | 21814 | 23642 | 17171 | 17317 | 18820 | 8232 | 7405 | 11197 |
| ICML 2025 position | 16074 | 15865 | 13546 | 15015 | 12234 | 3748 | 6228 | 6947 |
| IJCAI 2023 main | 19720 | 18341 | 16542 | 13469 | 13817 | 11915 | 9623 | 9677 |
| IJCAI 2023 survey | 20504 | 10144 | 11254 | 11454 | 10101 | 5736 | 5247 | 5142 |
| KDD 2025 research | 18396 | 12863 | 12067 | 10819 | 11069 | 4714 | 4808 | 7740 |
| NeurIPS 2025 datasets and benchmarks | 19056 | 13947 | 14495 | 12409 | 14505 | 4376 | 5292 | 8957 |
| NeurIPS 2025 main | 21185 | 20061 | 12420 | 12315 | 13745 | 5657 | 5685 | 7582 |
| SIGIR 2025 full paper | 19632 | 17434 | 13771 | 13862 | 14463 | 12300 | 13576 | 7729 |
| TheWebConf 2025 research | 18278 | 12790 | 12569 | 11795 | 11795 | 12562 | 12241 | 6256 |
| WACV 2026 main | 16231 | 13344 | 13592 | 14037 | 12206 | 12771 | 5403 | 10185 |
Understanding AWC
What it tells us
A higher average word count often corresponds to reviews that include more justification, detailed analysis, or structured reasoning. However, excessively long reviews may introduce noise or redundancy, whereas overly short ones may lack sufficient explanation.
How is it calculated
: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.
: full textual review generated for the i-th paper
|Words(.)|: counts the total number of tokenized words in the review
Putting it together
The average word count complements accuracy-oriented metrics such as NMAE and SRC by capturing stylistic characteristics of the generated reviews, revealing how concisely or thoroughly different models articulate their reasoning.
Want to learn more?
For further technical details, see our short paper accepted at INLG 2025 (18th International Conference on Natural Language Generation).
Get high-quality reviews powered by the best AI models
We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.
Start your free review