CSPaper Benchmark: AI Models for CS Paper Reviews

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing the latest LLMs from OpenAI (GPT) and Google (Gemini) across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

How we benchmark

Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings, after accounting for the venue-specific rating scales. It provides an interpretable, scale-agnostic measure of overall review accuracy, where lower NMAE values indicate better alignment between predicted and true review ratings.

Conference
Geminigemini-2.5-pro
Geminigemini-3-flash-preview
Geminigemini-3-pro-preview
OpenAIGPT-4.1
OpenAIGPT-5
OpenAIo3
OpenAIGPT-5.1
Geminigemini-2.5-flash
OpenAIGPT-5.2
AAAI main technical0.102±0.1250.134±0.1080.112±0.1110.112±0.1080.114±0.1140.115±0.1170.148±0.1320.105±0.1040.200±0.172
AAAI safe and robust AI0.246±0.2820.190±0.3120.262±0.2940.175±0.2090.167±0.2280.270±0.2580.270±0.2690.167±0.2350.325±0.234
AAMAS main technical0.162±0.2690.155±0.2360.172±0.2600.197±0.2100.259±0.2550.252±0.2520.266±0.2830.190±0.2710.328±0.303
ACL main0.153±0.1350.204±0.1530.181±0.1320.162±0.1150.153±0.1590.130±0.1050.218±0.2260.194±0.2100.208±0.214
AISTATS main0.241±0.2150.208±0.2110.182±0.2040.252±0.2290.261±0.2180.251±0.2070.261±0.2180.295±0.2370.390±0.320
CVPR main0.214±0.2480.205±0.2320.185±0.2430.209±0.2330.167±0.2230.228±0.2490.272±0.2600.276±0.2570.271±0.258
EMNLP main0.150±0.1270.158±0.1530.133±0.1000.150±0.1180.108±0.1040.200±0.1990.183±0.1140.167±0.1680.250±0.295
ICASSP regular paper0.261±0.2000.232±0.1860.203±0.2070.217±0.2040.203±0.2190.246±0.2510.290±0.2520.232±0.2230.290±0.252
ICC technical symposia0.313±0.2720.380±0.3100.273±0.2490.393±0.2600.313±0.2610.367±0.2690.300±0.2670.313±0.2920.313±0.277
ICLR main0.108±0.1070.148±0.1300.135±0.1230.108±0.0910.119±0.1080.140±0.1070.104±0.0930.135±0.1130.133±0.124
ICME regular0.276±0.1990.120±0.1550.141±0.1820.141±0.1130.297±0.2930.287±0.2540.182±0.1670.234±0.2000.339±0.323
ICML main0.203±0.1380.270±0.2120.194±0.1470.134±0.0950.174±0.1700.185±0.1750.154±0.1150.219±0.1810.179±0.133
ICML position0.192±0.1630.215±0.1530.243±0.1570.163±0.1210.179±0.1710.251±0.1630.235±0.2380.200±0.1480.350±0.266
IJCAI main0.135±0.1170.079±0.1190.103±0.1190.095±0.1840.182±0.1420.230±0.2160.222±0.1690.198±0.1250.222±0.144
IJCAI survey0.094±0.1290.188±0.2220.125±0.1890.219±0.1600.188±0.2220.156±0.1860.188±0.2910.219±0.1600.125±0.189
KDD research0.167±0.1800.179±0.1730.205±0.1390.205±0.1690.218±0.2080.180±0.2200.256±0.2510.244±0.3700.321±0.250
NeurIPS datasets and benchmarks0.200±0.2300.181±0.2290.204±0.2230.183±0.2170.183±0.2520.316±0.3260.275±0.2140.295±0.3080.565±0.349
NeurIPS main0.189±0.1840.205±0.1770.232±0.1910.183±0.1500.256±0.2930.285±0.3030.218±0.1990.245±0.2150.255±0.239
SIGIR full paper0.025±0.0790.175±0.2370.150±0.2420.175±0.2370.125±0.1770.175±0.2060.250±0.3330.125±0.1770.325±0.472
TheWebConf research0.139±0.1380.143±0.1010.127±0.0890.113±0.0870.135±0.1150.106±0.0660.122±0.0730.156±0.1130.205±0.220
WACV main0.255±0.2210.182±0.1400.164±0.1210.164±0.1750.218±0.1400.182±0.1660.291±0.2070.309±0.2260.364±0.175

Understanding NMAE

Interpreting the numbers

  • Lower NMAE = better alignment. Top-performing models stay below 0.2.
  • The ± values show standard deviation, reflecting model stability across papers.

How we calculate it

NMAE=1Ni=1NSgt(i)Spred(i)SmaxSmin\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

NN: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.

Sgt(i)S_{\text{gt}}^{(i)}: denotes the ground-truth overall rating of the i-th benchmark paper, representing the consensus quantitative evaluation from official or expert evaluations.

Spred(i)S_{\text{pred}}^{(i)}: represents the predicted overall rating produced by our CSPR agent for the same paper.

SmaxS_{\text{max}}&SminS_{\text{min}}: define the venue-specific upper and lower bounds of the rating scale, respectively (e.g., 5 and 1 for a 1–5 rating system).

Spearman Rank Correlation (SRC) measures how well the predicted ordering of paper ratings aligns with the true ordering. Unlike error-based metrics (e.g., NMAE), Spearman focuses on the relative ranking between papers, making it robust to scale differences and suitable for assessing whether the reviewer model correctly prioritizes stronger papers over weaker ones.

Baseline SRC scores from the ICLR 2025 dataset: AgentReview (SRC = 0.12), AI Scientist (SRC = 0.32), CycleReviewer (SRC = 0.28), DeepReviewer (SRC = 0.40), Human Reviewer (SRC = 0.41). These scores are obtained from the ICLR 2025 dataset only, hence not directly comparable to our SRC values, but can serve as a baseline for reference.

Conference
Geminigemini-3-pro-preview
Geminigemini-2.5-pro
OpenAIGPT-5
OpenAIo3
Geminigemini-3-flash-preview
OpenAIGPT-4.1
Geminigemini-2.5-flash
OpenAIGPT-5.2
OpenAIGPT-5.1
AAAI main technical0.7790.8650.7620.6980.8030.7110.8060.6960.690
AAAI safe and robust AI0.2500.4360.6410.5730.2730.5910.5830.7040.524
AAMAS main technical0.6350.7310.7030.6750.7240.7160.7250.6010.644
ACL main0.4030.4040.2540.4540.2910.3510.2470.3430.087
AISTATS main0.7130.4100.1320.3970.5540.4830.2290.1490.224
CVPR main0.7110.5310.6690.3550.6380.4520.3440.6590.312
EMNLP main0.8300.7160.5970.0330.5770.4340.6070.5550.258
ICASSP regular paper0.4150.2260.5130.1490.2780.3870.276-0.0240.066
ICC technical symposia0.6530.5720.5660.3080.4110.2920.4630.6180.615
ICLR main0.7130.7730.7120.5400.7080.7880.6100.6010.718
ICME regular0.570-0.2740.2220.5420.3240.0090.2080.5530.257
ICML main0.7270.6380.7640.7360.7120.8120.5930.7090.766
ICML position0.4960.4270.5050.2320.4400.4790.3690.0910.469
IJCAI main0.6810.5530.5180.2920.6870.5070.2750.6830.465
IJCAI survey0.8820.9490.5460.7700.8820.6290.6290.8620.624
KDD research0.7260.5360.6310.8340.6090.5100.1080.5490.549
NeurIPS datasets and benchmarks0.4300.3100.494-0.0210.3970.4580.3430.2250.169
NeurIPS main0.6450.6080.5560.3710.6750.6720.4150.4840.465
SIGIR full paper0.7180.9810.8060.7440.7710.8320.7930.8410.786
TheWebConf research0.1770.2540.0390.2260.2400.0620.2710.2570.038
WACV main0.7640.2610.6130.7830.7650.5300.4700.7070.461

Understanding SRC

What the values mean

A higher Spearman correlation indicates better agreement with the ground-truth ranking:

  • SRC = 1: perfect rank alignment
  • SRC = 0: no rank correlation
  • SRC = -1: perfectly reversed ranking

The formula

SRC=16i=1Ndi2N(N21)\text{SRC} = 1 - \frac{6 \sum_{i=1}^{N} d_i^2}{N(N^2 - 1)}

NN: number of benchmark papers

did_i: difference between ground-truth rank and predicted rank for the i-th paper

Why ranking matters

As a rank-based statistic, SRC complements NMAE by capturing ordering accuracy. Even when absolute score predictions vary across venues or models, a high Spearman correlation indicates strong reviewer consistency in distinguishing higher- from lower-quality submissions.

Average Word Count (AWC) reflects the level of verbosity and elaboration in the written reviews. While it does not measure correctness or ranking fidelity directly, word count serves as an important proxy for review completeness, depth of reasoning, and overall writing richness.

Conference
OpenAIGPT-5.1
Geminigemini-2.5-flash
OpenAIGPT-4.1
OpenAIGPT-5
Geminigemini-2.5-pro
OpenAIGPT-5.2
Geminigemini-3-flash-preview
Geminigemini-3-pro-preview
OpenAIo3
AAAI main technical208151492213527132301140811351815273606116
AAAI safe and robust AI15492142071228712723119507824873980654940
AAMAS main technical13088118921115610300112796942850164685700
ACL main2469818135158711592415323148211124599208603
AISTATS main244271403111192126101076010322770476938131
CVPR main184951273611652126461151311750858589926348
EMNLP main2327516097145411488315230115841015789606410
ICASSP regular paper2523121402159391387018372147259570105388573
ICC technical symposia146761204710429931798149156842271715238
ICLR main212641501413469148381143513245873675856613
ICME regular191511558014878121871240692199838100776326
ICML main32594259981860517033189001862512798123958086
ICML position2515316330174731482014098133551094682048095
IJCAI main225701767516075158511391313600983297778069
IJCAI survey20498100851114211464101019535721051495238
KDD research18561137191334012681125789870862981076450
NeurIPS datasets and benchmarks20990175021405014466133008839971396076580
NeurIPS main215701820215011130161439312958974087725893
SIGIR full paper19617147121458915868145879295890387156635
TheWebConf research219851618215271141631246612932905079166618
WACV main192541243912446123591152512898906399086805

Understanding AWC

What it tells us

A higher average word count often corresponds to reviews that include more justification, detailed analysis, or structured reasoning. However, excessively long reviews may introduce noise or redundancy, whereas overly short ones may lack sufficient explanation.

How is it calculated

AWC=1Ni=1NWords(Rcspr(i))\text{AWC} = \frac{1}{N} \sum_{i=1}^{N} |\text{Words}(R_{\text{cspr}}^{(i)})|

NN: the total number of benchmark papers included in the evaluation. This number increases as the benchmark dataset expands and undergoes continuous calibration.

Rcspr(i)R_{\text{cspr}}^{(i)}: full textual review generated for the i-th paper

|Words(.)|: counts the total number of tokenized words in the review

Putting it together

The average word count complements accuracy-oriented metrics such as NMAE and SRC by capturing stylistic characteristics of the generated reviews, revealing how concisely or thoroughly different models articulate their reasoning.

Want to learn more?

For further technical details, see our short paper accepted at INLG 2025 (18th International Conference on Natural Language Generation).

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review