Benchmark on LLMs and Venues

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing the latest LLMs from OpenAI (GPT) and Google (Gemini) across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

Benchmark Metrics

Click on a metric below to view detailed benchmarks

Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings. Lower NMAE values indicate better alignment between predicted and true review ratings.

Conference28Journal10

Venue	GPT-5.4	Gemini-3.1-pro	Gemini-2.5-pro	GPT-5	GPT-5.1	Gemini-3-flash
ESWAmain	0.200±0.215	0.156±0.145	0.263±0.221	0.200±0.215	0.125±0.115	0.188±0.189
FORGEintelligence	0.111±0.143	0.116±0.117	0.041±0.055	0.066±0.139	0.066±0.092	0.129±0.119
IJMLCregular	0.221±0.094	0.179±0.128	—	—	—	—
JBDregular	0.214±0.157	0.214±0.187	0.143±0.183	0.250±0.191	0.214±0.213	0.214±0.213
MLJregular	0.242±0.108	0.193±0.123	0.168±0.134	0.217±0.165	0.217±0.152	0.177±0.172
Neurocomputingmain	0.275±0.124	0.190±0.138	0.185±0.150	0.200±0.142	0.300±0.206	0.180±0.169
PAAregular	0.121±0.108	0.104±0.105	—	—	—	—
SciRepregular	0.150±0.137	0.100±0.137	0.150±0.137	0.200±0.112	0.150±0.137	0.300±0.209
TMLRregular	0.182±0.141	0.250±0.224	—	—	0.136±0.104	—
TVCregular	0.208±0.204	0.125±0.158	—	—	—	—
AAAImain technical	0.164±0.131	0.110±0.102	0.097±0.111	0.111±0.113	0.145±0.129	0.131±0.107
AAAIsafe and robust AI	0.214±0.227	0.230±0.245	0.214±0.249	0.159±0.207	0.238±0.241	0.175±0.274
AAMASmain technical	0.134±0.105	0.099±0.105	0.079±0.098	0.210±0.168	0.183±0.196	0.106±0.086
ACLmain	0.154±0.144	0.213±0.147	0.162±0.131	0.150±0.150	0.205±0.186	0.189±0.129
AISTATSmain	0.266±0.261	0.176±0.185	0.212±0.171	0.231±0.178	0.231±0.178	0.179±0.160
CoGtechnical and vision	0.100±0.124	0.137±0.174	0.163±0.227	0.155±0.134	0.107±0.128	0.211±0.220
CVPRmain	0.242±0.216	0.228±0.205	0.201±0.213	0.154±0.181	0.249±0.229	0.192±0.194
ECML-PKDDresearch	0.102±0.180	0.188±0.191	0.141±0.046	0.169±0.158	0.141±0.162	0.231±0.149
EMNLPmain	0.117±0.167	0.117±0.088	0.150±0.127	0.108±0.104	0.183±0.114	0.158±0.153
FAccTmain	0.107±0.112	0.250±0.239	—	—	—	—
GFMall	0.164±0.016	0.198±0.130	—	—	—	—
ICASSPregular paper	0.275±0.239	0.246±0.240	0.261±0.200	0.203±0.219	0.290±0.252	0.232±0.186
ICCtechnical symposia	0.233±0.241	0.287±0.272	0.260±0.247	0.273±0.249	0.260±0.253	0.353±0.256
ICLRmain	0.115±0.129	0.141±0.117	0.108±0.107	0.119±0.108	0.104±0.093	0.148±0.130
ICMEregular	0.266±0.181	0.162±0.162	0.276±0.199	0.297±0.293	0.182±0.167	0.120±0.155
ICMLmain	0.135±0.115	0.186±0.135	0.196±0.137	0.190±0.150	0.170±0.123	0.266±0.183
ICMLposition	0.169±0.136	0.149±0.108	—	—	—	—
IJCAImain	0.190±0.141	0.174±0.143	0.159±0.121	0.198±0.139	0.238±0.162	0.095±0.130
IJCAIsurvey	0.062±0.116	0.188±0.222	0.094±0.129	0.188±0.222	0.188±0.291	0.188±0.222
IROSmain	0.121±0.109	0.125±0.065	0.104±0.039	0.093±0.093	0.086±0.043	0.143±0.080
KDDdatasets and benchmarks	0.078±0.050	0.217±0.205	0.214±0.228	0.175±0.166	0.189±0.198	0.256±0.208
KDDresearch	0.231±0.174	0.180±0.107	0.167±0.180	0.218±0.158	0.205±0.182	0.179±0.173
NeurIPSdatasets and benchmarks	0.249±0.196	0.161±0.197	0.193±0.203	0.170±0.210	0.268±0.187	0.174±0.202
NeurIPSmain	0.144±0.141	0.185±0.177	0.183±0.163	0.207±0.228	0.203±0.185	0.193±0.158
SIGIRfull paper	0.163±0.205	0.138±0.208	0.037±0.084	0.113±0.171	0.237±0.266	0.188±0.230
SIGIRshort paper	0.132±0.104	0.154±0.192	0.296±0.288	0.154±0.127	0.189±0.151	0.154±0.192
TheWebConfresearch	0.149±0.154	0.106±0.086	0.141±0.137	0.126±0.114	0.116±0.075	0.145±0.099
WACVmain	0.255±0.129	0.127±0.135	0.236±0.216	0.200±0.126	0.273±0.205	0.164±0.121

Understanding NMAE

Interpreting the numbers

Lower NMAE = better alignment. Top-performing models stay below 0.2.
The ± values show standard deviation, reflecting model stability across papers.

How we calculate it

\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

$N$ : the total number of benchmark papers included in the evaluation.

$S_{\text{gt}}^{(i)}$ : ground-truth overall rating of the i-th benchmark paper.

$S_{\text{pred}}^{(i)}$ : predicted overall rating produced by our CSPR agent.

$S_{\text{max}}$ & $S_{\text{min}}$ : venue-specific upper and lower bounds of the rating scale.

Want to learn more?

Explore the research and technical foundations behind CSPaper (a part of Scholar7):

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review

Benchmark on LLMs and Venues

Privacy and security first

Benchmark Metrics

Accuracy (NMAE)

Bias (NME)

Ranking (SRC)

Elaborateness (AWC)

Understanding NMAE

Interpreting the numbers

How we calculate it

Want to learn more?

Get high-quality reviews powered by the best AI models