Benchmark on LLMs and Venues

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing the latest LLMs from OpenAI (GPT) and Google (Gemini) across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

Benchmark Metrics

Click on a metric below to view detailed benchmarks

Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings. Lower NMAE values indicate better alignment between predicted and true review ratings.

Filter by venue type
Venue
OpenAIGPT-5.4
GeminiGemini-3.1-pro
GeminiGemini-2.5-pro
OpenAIGPT-5
OpenAIGPT-5.1
GeminiGemini-3-flash
ESWAmain0.200±0.2150.156±0.1450.263±0.2210.200±0.2150.125±0.1150.188±0.189
FORGEintelligence0.111±0.1430.116±0.1170.041±0.0550.066±0.1390.066±0.0920.129±0.119
IJMLCregular0.221±0.0940.179±0.128
JBDregular0.214±0.1570.214±0.1870.143±0.1830.250±0.1910.214±0.2130.214±0.213
MLJregular0.242±0.1080.193±0.1230.168±0.1340.217±0.1650.217±0.1520.177±0.172
Neurocomputingmain0.275±0.1240.190±0.1380.185±0.1500.200±0.1420.300±0.2060.180±0.169
PAAregular0.121±0.1080.104±0.105
SciRepregular0.150±0.1370.100±0.1370.150±0.1370.200±0.1120.150±0.1370.300±0.209
TMLRregular0.182±0.1410.250±0.2240.136±0.104
TVCregular0.208±0.2040.125±0.158
AAAImain technical0.164±0.1310.110±0.1020.097±0.1110.111±0.1130.145±0.1290.131±0.107
AAAIsafe and robust AI0.214±0.2270.230±0.2450.214±0.2490.159±0.2070.238±0.2410.175±0.274
AAMASmain technical0.134±0.1050.099±0.1050.079±0.0980.210±0.1680.183±0.1960.106±0.086
ACLmain0.154±0.1440.213±0.1470.162±0.1310.150±0.1500.205±0.1860.189±0.129
AISTATSmain0.266±0.2610.176±0.1850.212±0.1710.231±0.1780.231±0.1780.179±0.160
CoGtechnical and vision0.100±0.1240.137±0.1740.163±0.2270.155±0.1340.107±0.1280.211±0.220
CVPRmain0.242±0.2160.228±0.2050.201±0.2130.154±0.1810.249±0.2290.192±0.194
ECML-PKDDresearch0.102±0.1800.188±0.1910.141±0.0460.169±0.1580.141±0.1620.231±0.149
EMNLPmain0.117±0.1670.117±0.0880.150±0.1270.108±0.1040.183±0.1140.158±0.153
FAccTmain0.107±0.1120.250±0.239
GFMall0.164±0.0160.198±0.130
ICASSPregular paper0.275±0.2390.246±0.2400.261±0.2000.203±0.2190.290±0.2520.232±0.186
ICCtechnical symposia0.233±0.2410.287±0.2720.260±0.2470.273±0.2490.260±0.2530.353±0.256
ICLRmain0.115±0.1290.141±0.1170.108±0.1070.119±0.1080.104±0.0930.148±0.130
ICMEregular0.266±0.1810.162±0.1620.276±0.1990.297±0.2930.182±0.1670.120±0.155
ICMLmain0.135±0.1150.186±0.1350.196±0.1370.190±0.1500.170±0.1230.266±0.183
ICMLposition0.169±0.1360.149±0.108
IJCAImain0.190±0.1410.174±0.1430.159±0.1210.198±0.1390.238±0.1620.095±0.130
IJCAIsurvey0.062±0.1160.188±0.2220.094±0.1290.188±0.2220.188±0.2910.188±0.222
IROSmain0.121±0.1090.125±0.0650.104±0.0390.093±0.0930.086±0.0430.143±0.080
KDDdatasets and benchmarks0.078±0.0500.217±0.2050.214±0.2280.175±0.1660.189±0.1980.256±0.208
KDDresearch0.231±0.1740.180±0.1070.167±0.1800.218±0.1580.205±0.1820.179±0.173
NeurIPSdatasets and benchmarks0.249±0.1960.161±0.1970.193±0.2030.170±0.2100.268±0.1870.174±0.202
NeurIPSmain0.144±0.1410.185±0.1770.183±0.1630.207±0.2280.203±0.1850.193±0.158
SIGIRfull paper0.163±0.2050.138±0.2080.037±0.0840.113±0.1710.237±0.2660.188±0.230
SIGIRshort paper0.132±0.1040.154±0.1920.296±0.2880.154±0.1270.189±0.1510.154±0.192
TheWebConfresearch0.149±0.1540.106±0.0860.141±0.1370.126±0.1140.116±0.0750.145±0.099
WACVmain0.255±0.1290.127±0.1350.236±0.2160.200±0.1260.273±0.2050.164±0.121

Understanding NMAE

Interpreting the numbers

  • Lower NMAE = better alignment. Top-performing models stay below 0.2.
  • The ± values show standard deviation, reflecting model stability across papers.

How we calculate it

NMAE=1Ni=1NSgt(i)Spred(i)SmaxSmin\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

NN: the total number of benchmark papers included in the evaluation.

Sgt(i)S_{\text{gt}}^{(i)}: ground-truth overall rating of the i-th benchmark paper.

Spred(i)S_{\text{pred}}^{(i)}: predicted overall rating produced by our CSPR agent.

SmaxS_{\text{max}}&SminS_{\text{min}}: venue-specific upper and lower bounds of the rating scale.

Want to learn more?

Explore the research and technical foundations behind CSPaper (a part of Scholar7):

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review