Benchmark on LLMs and Venues

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing the latest LLMs from OpenAI (GPT) and Google (Gemini) across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

Benchmark Metrics

Click on a metric below to view detailed benchmarks

Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings. Lower NMAE values indicate better alignment between predicted and true review ratings.

Filter by venue type
Venue
OpenAIGPT-5.4
GeminiGemini-2.5-pro
GeminiGemini-3.1-pro
OpenAIGPT-5
GeminiGemini-3-flash
OpenAIGPT-5.1
OpenAIGPT-5.2
GeminiGemini-2.5-flash
OpenAIGPT-4.1
ESWAmain0.200±0.2150.263±0.2210.156±0.1450.200±0.2150.188±0.1890.125±0.1150.200±0.2150.219±0.1970.188±0.133
FORGEintelligence0.111±0.1430.041±0.0550.116±0.1170.066±0.1390.129±0.1190.066±0.0920.086±0.1430.341±0.3050.074±0.087
JBDregular0.214±0.1570.143±0.1830.214±0.1870.250±0.1910.214±0.2130.214±0.2130.179±0.2270.179±0.2020.357±0.210
MLJregular0.242±0.1080.168±0.1340.193±0.1230.217±0.1650.177±0.1720.217±0.1520.217±0.1890.193±0.1230.247±0.137
Neurocomputingmain0.275±0.1240.185±0.1500.190±0.1380.200±0.1420.180±0.1690.300±0.2060.260±0.2190.185±0.1720.155±0.155
SciRepregular0.150±0.1370.150±0.1370.100±0.1370.200±0.1120.300±0.2090.150±0.1370.100±0.1370.150±0.1370.400±0.335
TMLRregular0.182±0.1410.250±0.2240.136±0.104
AAAImain technical0.164±0.1310.097±0.1110.110±0.1020.111±0.1130.131±0.1070.145±0.1290.186±0.1540.105±0.1040.112±0.108
AAAIsafe and robust AI0.214±0.2270.214±0.2490.230±0.2450.159±0.2070.175±0.2740.238±0.2410.286±0.2110.151±0.2040.167±0.185
AAMASmain technical0.134±0.1050.079±0.0980.099±0.1050.210±0.1680.106±0.0860.183±0.1960.252±0.2300.085±0.0920.127±0.106
ACLmain0.154±0.1440.162±0.1310.213±0.1470.150±0.1500.189±0.1290.205±0.1860.201±0.1840.193±0.1670.158±0.111
AISTATSmain0.266±0.2610.212±0.1710.176±0.1850.231±0.1780.179±0.1600.231±0.1780.302±0.2560.251±0.1760.223±0.189
CoGtechnical and vision0.100±0.1240.163±0.2270.137±0.1740.155±0.1340.211±0.2200.107±0.1280.211±0.2440.230±0.2470.200±0.227
CVPRmain0.242±0.2160.201±0.2130.228±0.2050.154±0.1810.192±0.1940.249±0.2290.239±0.2020.244±0.2070.196±0.195
ECML-PKDDresearch0.102±0.1800.141±0.0460.188±0.1910.169±0.1580.231±0.1490.141±0.1620.164±0.1930.212±0.1480.164±0.200
EMNLPmain0.117±0.1670.150±0.1270.117±0.0880.108±0.1040.158±0.1530.183±0.1140.200±0.2100.150±0.1270.150±0.118
FAccTmain0.107±0.1120.250±0.239
GFMall0.164±0.0160.198±0.130
ICASSPregular paper0.275±0.2390.261±0.2000.246±0.2400.203±0.2190.232±0.1860.290±0.2520.290±0.2520.232±0.2230.217±0.204
ICCtechnical symposia0.233±0.2410.260±0.2470.287±0.2720.273±0.2490.353±0.2560.260±0.2530.273±0.2660.300±0.2620.367±0.209
ICLRmain0.103±0.1090.108±0.1070.141±0.1170.119±0.1080.148±0.1300.104±0.0930.133±0.1240.135±0.1130.108±0.091
ICMEregular0.266±0.1810.276±0.1990.162±0.1620.297±0.2930.120±0.1550.182±0.1670.339±0.3230.234±0.2000.141±0.113
ICMLmain0.135±0.1150.196±0.1370.186±0.1350.190±0.1500.266±0.1830.170±0.1230.167±0.1380.222±0.1660.153±0.102
ICMLposition0.169±0.1360.149±0.108
IJCAImain0.190±0.1410.159±0.1210.174±0.1430.198±0.1390.095±0.1300.238±0.1620.182±0.1420.214±0.1410.119±0.192
IJCAIsurvey0.062±0.1160.094±0.1290.188±0.2220.188±0.2220.188±0.2220.188±0.2910.125±0.1890.219±0.1600.219±0.160
IROSmain0.121±0.1090.104±0.0390.125±0.0650.093±0.0930.143±0.0800.086±0.0430.057±0.0490.107±0.0350.211±0.188
KDDdatasets and benchmarks0.078±0.0500.214±0.2280.217±0.2050.175±0.1660.256±0.2080.189±0.1980.231±0.1880.186±0.2090.200±0.174
KDDresearch0.231±0.1740.167±0.1800.180±0.1070.218±0.1580.179±0.1730.205±0.1820.218±0.1720.218±0.3220.205±0.169
NeurIPSdatasets and benchmarks0.249±0.1960.193±0.2030.161±0.1970.170±0.2100.174±0.2020.268±0.1870.471±0.2650.268±0.2530.176±0.187
NeurIPSmain0.144±0.1410.183±0.1630.185±0.1770.207±0.2280.193±0.1580.203±0.1850.213±0.1880.210±0.1770.175±0.125
SIGIRfull paper0.163±0.2050.037±0.0840.138±0.2080.113±0.1710.188±0.2300.237±0.2660.138±0.2080.138±0.1710.188±0.230
SIGIRshort paper0.132±0.1040.296±0.2880.154±0.1920.154±0.1270.154±0.1920.189±0.1510.275±0.1260.168±0.2000.225±0.162
TheWebConfresearch0.149±0.1540.141±0.1370.106±0.0860.126±0.1140.145±0.0990.116±0.0750.188±0.1770.140±0.1040.112±0.087
WACVmain0.255±0.1290.236±0.2160.127±0.1350.200±0.1260.164±0.1210.273±0.2050.345±0.1810.273±0.1850.164±0.175

Understanding NMAE

Interpreting the numbers

  • Lower NMAE = better alignment. Top-performing models stay below 0.2.
  • The ± values show standard deviation, reflecting model stability across papers.

How we calculate it

NMAE=1Ni=1NSgt(i)Spred(i)SmaxSmin\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

NN: the total number of benchmark papers included in the evaluation.

Sgt(i)S_{\text{gt}}^{(i)}: ground-truth overall rating of the i-th benchmark paper.

Spred(i)S_{\text{pred}}^{(i)}: predicted overall rating produced by our CSPR agent.

SmaxS_{\text{max}}&SminS_{\text{min}}: venue-specific upper and lower bounds of the rating scale.

Want to learn more?

Explore the research and technical foundations behind CSPaper (a part of Scholar7):

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review