CSPaper Benchmark: AI Models for CS Paper Reviews
We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing the latest LLMs from OpenAI (GPT) and Google (Gemini) across various top-tier computer science conferences.
Privacy and security first
How we benchmark
Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings, after accounting for the venue-specific rating scales. It provides an interpretable, scale-agnostic measure of overall review accuracy, where lower NMAE values indicate better alignment between predicted and true review ratings.
| Conference | gemini-2.5-pro | gemini-3-flash-preview | gemini-3-pro-preview | GPT-4.1 | GPT-5 | o3 | GPT-5.1 | gemini-2.5-flash | GPT-5.2 |
|---|---|---|---|---|---|---|---|---|---|
| AAAI main technical | 0.102±0.125 | 0.134±0.108 | 0.112±0.111 | 0.112±0.108 | 0.114±0.114 | 0.115±0.117 | 0.148±0.132 | 0.105±0.104 | 0.200±0.172 |
| AAAI safe and robust AI | 0.246±0.282 | 0.190±0.312 | 0.262±0.294 | 0.175±0.209 | 0.167±0.228 | 0.270±0.258 | 0.270±0.269 | 0.167±0.235 | 0.325±0.234 |
| AAMAS main technical | 0.162±0.269 | 0.155±0.236 | 0.172±0.260 | 0.197±0.210 | 0.259±0.255 | 0.252±0.252 | 0.266±0.283 | 0.190±0.271 | 0.328±0.303 |
| ACL main | 0.153±0.135 | 0.204±0.153 | 0.181±0.132 | 0.162±0.115 | 0.153±0.159 | 0.130±0.105 | 0.218±0.226 | 0.194±0.210 | 0.208±0.214 |
| AISTATS main | 0.241±0.215 | 0.208±0.211 | 0.182±0.204 | 0.252±0.229 | 0.261±0.218 | 0.251±0.207 | 0.261±0.218 | 0.295±0.237 | 0.390±0.320 |
| CVPR main | 0.214±0.248 | 0.205±0.232 | 0.185±0.243 | 0.209±0.233 | 0.167±0.223 | 0.228±0.249 | 0.272±0.260 | 0.276±0.257 | 0.271±0.258 |
| EMNLP main | 0.150±0.127 | 0.158±0.153 | 0.133±0.100 | 0.150±0.118 | 0.108±0.104 | 0.200±0.199 | 0.183±0.114 | 0.167±0.168 | 0.250±0.295 |
| ICASSP regular paper | 0.261±0.200 | 0.232±0.186 | 0.203±0.207 | 0.217±0.204 | 0.203±0.219 | 0.246±0.251 | 0.290±0.252 | 0.232±0.223 | 0.290±0.252 |
| ICC technical symposia | 0.313±0.272 | 0.380±0.310 | 0.273±0.249 | 0.393±0.260 | 0.313±0.261 | 0.367±0.269 | 0.300±0.267 | 0.313±0.292 | 0.313±0.277 |
| ICLR main | 0.108±0.107 | 0.148±0.130 | 0.135±0.123 | 0.108±0.091 | 0.119±0.108 | 0.140±0.107 | 0.104±0.093 | 0.135±0.113 | 0.133±0.124 |
| ICME regular | 0.276±0.199 | 0.120±0.155 | 0.141±0.182 | 0.141±0.113 | 0.297±0.293 | 0.287±0.254 | 0.182±0.167 | 0.234±0.200 | 0.339±0.323 |
| ICML main | 0.203±0.138 | 0.270±0.212 | 0.194±0.147 | 0.134±0.095 | 0.174±0.170 | 0.185±0.175 | 0.154±0.115 | 0.219±0.181 | 0.179±0.133 |
| ICML position | 0.192±0.163 | 0.215±0.153 | 0.243±0.157 | 0.163±0.121 | 0.179±0.171 | 0.251±0.163 | 0.235±0.238 | 0.200±0.148 | 0.350±0.266 |
| IJCAI main | 0.135±0.117 | 0.079±0.119 | 0.103±0.119 | 0.095±0.184 | 0.182±0.142 | 0.230±0.216 | 0.222±0.169 | 0.198±0.125 | 0.222±0.144 |
| IJCAI survey | 0.094±0.129 | 0.188±0.222 | 0.125±0.189 | 0.219±0.160 | 0.188±0.222 | 0.156±0.186 | 0.188±0.291 | 0.219±0.160 | 0.125±0.189 |
| KDD research | 0.167±0.180 | 0.179±0.173 | 0.205±0.139 | 0.205±0.169 | 0.218±0.208 | 0.180±0.220 | 0.256±0.251 | 0.244±0.370 | 0.321±0.250 |
| NeurIPS datasets and benchmarks | 0.200±0.230 | 0.181±0.229 | 0.204±0.223 | 0.183±0.217 | 0.183±0.252 | 0.316±0.326 | 0.275±0.214 | 0.295±0.308 | 0.565±0.349 |
| NeurIPS main | 0.189±0.184 | 0.205±0.177 | 0.232±0.191 | 0.183±0.150 | 0.256±0.293 | 0.285±0.303 | 0.218±0.199 | 0.245±0.215 | 0.255±0.239 |
| SIGIR full paper | 0.025±0.079 | 0.175±0.237 | 0.150±0.242 | 0.175±0.237 | 0.125±0.177 | 0.175±0.206 | 0.250±0.333 | 0.125±0.177 | 0.325±0.472 |
| TheWebConf research | 0.139±0.138 | 0.143±0.101 | 0.127±0.089 | 0.113±0.087 | 0.135±0.115 | 0.106±0.066 | 0.122±0.073 | 0.156±0.113 | 0.205±0.220 |
| WACV main | 0.255±0.221 | 0.182±0.140 | 0.164±0.121 | 0.164±0.175 | 0.218±0.140 | 0.182±0.166 | 0.291±0.207 | 0.309±0.226 | 0.364±0.175 |
Understanding NMAE
Interpreting the numbers
- Lower NMAE = better alignment. Top-performing models stay below 0.2.
- The ± values show standard deviation, reflecting model stability across papers.
How we calculate it
: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.
: denotes the ground-truth overall rating of the i-th benchmark paper, representing the consensus quantitative evaluation from official or expert evaluations.
: represents the predicted overall rating produced by our CSPR agent for the same paper.
&: define the venue-specific upper and lower bounds of the rating scale, respectively (e.g., 5 and 1 for a 1–5 rating system).
Spearman Rank Correlation (SRC) measures how well the predicted ordering of paper ratings aligns with the true ordering. Unlike error-based metrics (e.g., NMAE), Spearman focuses on the relative ranking between papers, making it robust to scale differences and suitable for assessing whether the reviewer model correctly prioritizes stronger papers over weaker ones.
Baseline SRC scores from the ICLR 2025 dataset: AgentReview (SRC = 0.12), AI Scientist (SRC = 0.32), CycleReviewer (SRC = 0.28), DeepReviewer (SRC = 0.40), Human Reviewer (SRC = 0.41). These scores are obtained from the ICLR 2025 dataset only, hence not directly comparable to our SRC values, but can serve as a baseline for reference.
| Conference | gemini-3-pro-preview | gemini-2.5-pro | GPT-5 | o3 | gemini-3-flash-preview | GPT-4.1 | gemini-2.5-flash | GPT-5.2 | GPT-5.1 |
|---|---|---|---|---|---|---|---|---|---|
| AAAI main technical | 0.779 | 0.865 | 0.762 | 0.698 | 0.803 | 0.711 | 0.806 | 0.696 | 0.690 |
| AAAI safe and robust AI | 0.250 | 0.436 | 0.641 | 0.573 | 0.273 | 0.591 | 0.583 | 0.704 | 0.524 |
| AAMAS main technical | 0.635 | 0.731 | 0.703 | 0.675 | 0.724 | 0.716 | 0.725 | 0.601 | 0.644 |
| ACL main | 0.403 | 0.404 | 0.254 | 0.454 | 0.291 | 0.351 | 0.247 | 0.343 | 0.087 |
| AISTATS main | 0.713 | 0.410 | 0.132 | 0.397 | 0.554 | 0.483 | 0.229 | 0.149 | 0.224 |
| CVPR main | 0.711 | 0.531 | 0.669 | 0.355 | 0.638 | 0.452 | 0.344 | 0.659 | 0.312 |
| EMNLP main | 0.830 | 0.716 | 0.597 | 0.033 | 0.577 | 0.434 | 0.607 | 0.555 | 0.258 |
| ICASSP regular paper | 0.415 | 0.226 | 0.513 | 0.149 | 0.278 | 0.387 | 0.276 | -0.024 | 0.066 |
| ICC technical symposia | 0.653 | 0.572 | 0.566 | 0.308 | 0.411 | 0.292 | 0.463 | 0.618 | 0.615 |
| ICLR main | 0.713 | 0.773 | 0.712 | 0.540 | 0.708 | 0.788 | 0.610 | 0.601 | 0.718 |
| ICME regular | 0.570 | -0.274 | 0.222 | 0.542 | 0.324 | 0.009 | 0.208 | 0.553 | 0.257 |
| ICML main | 0.727 | 0.638 | 0.764 | 0.736 | 0.712 | 0.812 | 0.593 | 0.709 | 0.766 |
| ICML position | 0.496 | 0.427 | 0.505 | 0.232 | 0.440 | 0.479 | 0.369 | 0.091 | 0.469 |
| IJCAI main | 0.681 | 0.553 | 0.518 | 0.292 | 0.687 | 0.507 | 0.275 | 0.683 | 0.465 |
| IJCAI survey | 0.882 | 0.949 | 0.546 | 0.770 | 0.882 | 0.629 | 0.629 | 0.862 | 0.624 |
| KDD research | 0.726 | 0.536 | 0.631 | 0.834 | 0.609 | 0.510 | 0.108 | 0.549 | 0.549 |
| NeurIPS datasets and benchmarks | 0.430 | 0.310 | 0.494 | -0.021 | 0.397 | 0.458 | 0.343 | 0.225 | 0.169 |
| NeurIPS main | 0.645 | 0.608 | 0.556 | 0.371 | 0.675 | 0.672 | 0.415 | 0.484 | 0.465 |
| SIGIR full paper | 0.718 | 0.981 | 0.806 | 0.744 | 0.771 | 0.832 | 0.793 | 0.841 | 0.786 |
| TheWebConf research | 0.177 | 0.254 | 0.039 | 0.226 | 0.240 | 0.062 | 0.271 | 0.257 | 0.038 |
| WACV main | 0.764 | 0.261 | 0.613 | 0.783 | 0.765 | 0.530 | 0.470 | 0.707 | 0.461 |
Understanding SRC
What the values mean
A higher Spearman correlation indicates better agreement with the ground-truth ranking:
- SRC = 1: perfect rank alignment
- SRC = 0: no rank correlation
- SRC = -1: perfectly reversed ranking
The formula
: number of benchmark papers
: difference between ground-truth rank and predicted rank for the i-th paper
Why ranking matters
As a rank-based statistic, SRC complements NMAE by capturing ordering accuracy. Even when absolute score predictions vary across venues or models, a high Spearman correlation indicates strong reviewer consistency in distinguishing higher- from lower-quality submissions.
Average Word Count (AWC) reflects the level of verbosity and elaboration in the written reviews. While it does not measure correctness or ranking fidelity directly, word count serves as an important proxy for review completeness, depth of reasoning, and overall writing richness.
| Conference | GPT-5.1 | gemini-2.5-flash | GPT-4.1 | GPT-5 | gemini-2.5-pro | GPT-5.2 | gemini-3-flash-preview | gemini-3-pro-preview | o3 |
|---|---|---|---|---|---|---|---|---|---|
| AAAI main technical | 20815 | 14922 | 13527 | 13230 | 11408 | 11351 | 8152 | 7360 | 6116 |
| AAAI safe and robust AI | 15492 | 14207 | 12287 | 12723 | 11950 | 7824 | 8739 | 8065 | 4940 |
| AAMAS main technical | 13088 | 11892 | 11156 | 10300 | 11279 | 6942 | 8501 | 6468 | 5700 |
| ACL main | 24698 | 18135 | 15871 | 15924 | 15323 | 14821 | 11245 | 9920 | 8603 |
| AISTATS main | 24427 | 14031 | 11192 | 12610 | 10760 | 10322 | 7704 | 7693 | 8131 |
| CVPR main | 18495 | 12736 | 11652 | 12646 | 11513 | 11750 | 8585 | 8992 | 6348 |
| EMNLP main | 23275 | 16097 | 14541 | 14883 | 15230 | 11584 | 10157 | 8960 | 6410 |
| ICASSP regular paper | 25231 | 21402 | 15939 | 13870 | 18372 | 14725 | 9570 | 10538 | 8573 |
| ICC technical symposia | 14676 | 12047 | 10429 | 9317 | 9814 | 9156 | 8422 | 7171 | 5238 |
| ICLR main | 21264 | 15014 | 13469 | 14838 | 11435 | 13245 | 8736 | 7585 | 6613 |
| ICME regular | 19151 | 15580 | 14878 | 12187 | 12406 | 9219 | 9838 | 10077 | 6326 |
| ICML main | 32594 | 25998 | 18605 | 17033 | 18900 | 18625 | 12798 | 12395 | 8086 |
| ICML position | 25153 | 16330 | 17473 | 14820 | 14098 | 13355 | 10946 | 8204 | 8095 |
| IJCAI main | 22570 | 17675 | 16075 | 15851 | 13913 | 13600 | 9832 | 9777 | 8069 |
| IJCAI survey | 20498 | 10085 | 11142 | 11464 | 10101 | 9535 | 7210 | 5149 | 5238 |
| KDD research | 18561 | 13719 | 13340 | 12681 | 12578 | 9870 | 8629 | 8107 | 6450 |
| NeurIPS datasets and benchmarks | 20990 | 17502 | 14050 | 14466 | 13300 | 8839 | 9713 | 9607 | 6580 |
| NeurIPS main | 21570 | 18202 | 15011 | 13016 | 14393 | 12958 | 9740 | 8772 | 5893 |
| SIGIR full paper | 19617 | 14712 | 14589 | 15868 | 14587 | 9295 | 8903 | 8715 | 6635 |
| TheWebConf research | 21985 | 16182 | 15271 | 14163 | 12466 | 12932 | 9050 | 7916 | 6618 |
| WACV main | 19254 | 12439 | 12446 | 12359 | 11525 | 12898 | 9063 | 9908 | 6805 |
Understanding AWC
What it tells us
A higher average word count often corresponds to reviews that include more justification, detailed analysis, or structured reasoning. However, excessively long reviews may introduce noise or redundancy, whereas overly short ones may lack sufficient explanation.
How is it calculated
: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.
: full textual review generated for the i-th paper
|Words(.)|: counts the total number of tokenized words in the review
Putting it together
The average word count complements accuracy-oriented metrics such as NMAE and SRC by capturing stylistic characteristics of the generated reviews, revealing how concisely or thoroughly different models articulate their reasoning.
Want to learn more?
For further technical details, see our short paper accepted at INLG 2025 (18th International Conference on Natural Language Generation).
Get high-quality reviews powered by the best AI models
We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.
Start your free review