Benchmark on LLMs and Venues

We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing the latest LLMs from OpenAI (GPT) and Google (Gemini) across various top-tier computer science conferences.

Try it yourself3 free reviews/month

Privacy and security first

CSPaper ensures all papers remain confidential and legally protected. Content is processed securely and safeguarded under enterprise-level agreements with LLM providers — guaranteeing it is never used to train any model.

Benchmark Metrics

Click on a metric below to view detailed benchmarks

Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings. Lower NMAE values indicate better alignment between predicted and true review ratings.

Filter by venue type
Venue
GeminiGemini-2.5-pro
OpenAIGPT-5.4
GeminiGemini-3.1-pro
OpenAIGPT-5
GeminiGemini-3-flash
OpenAIGPT-5.1
GeminiGemini-2.5-flash
OpenAIGPT-4.1
OpenAIGPT-5.2
ESWAmain0.263±0.2210.200±0.2150.156±0.1450.200±0.2150.188±0.1890.125±0.1150.219±0.1970.188±0.1330.200±0.215
FORGEintelligence0.041±0.0550.111±0.1430.116±0.1170.066±0.1390.129±0.1190.066±0.0920.341±0.3050.074±0.0870.086±0.143
MLJregular0.168±0.1340.242±0.1080.193±0.1230.217±0.1650.177±0.1720.217±0.1520.193±0.1230.247±0.1370.217±0.189
Neurocomputingmain0.185±0.1500.275±0.1240.190±0.1380.200±0.1420.180±0.1690.300±0.2060.185±0.1720.155±0.1550.260±0.219
TMLRregular0.168±0.1340.242±0.1080.193±0.1230.217±0.1650.177±0.1720.217±0.1520.193±0.1230.247±0.1370.217±0.189
AAAImain technical0.097±0.1110.164±0.1310.110±0.1020.111±0.1130.131±0.1070.145±0.1290.105±0.1040.112±0.1080.186±0.154
AAAIsafe and robust AI0.214±0.2490.214±0.2270.230±0.2450.159±0.2070.175±0.2740.238±0.2410.151±0.2040.167±0.1850.286±0.211
AAMASmain technical0.079±0.0980.134±0.1050.099±0.1050.210±0.1680.106±0.0860.183±0.1960.085±0.0920.127±0.1060.252±0.230
ACLmain0.153±0.1350.153±0.1450.208±0.1540.153±0.1590.194±0.1320.208±0.1950.185±0.1760.162±0.1150.199±0.188
AISTATSmain0.212±0.1710.266±0.2610.176±0.1850.231±0.1780.179±0.1600.231±0.1780.251±0.1760.223±0.1890.302±0.256
CoGtechnical and vision0.163±0.2270.100±0.1240.137±0.1740.155±0.1340.211±0.2200.107±0.1280.230±0.2470.200±0.2270.211±0.244
CVPRmain0.201±0.2130.242±0.2160.228±0.2050.154±0.1810.192±0.1940.249±0.2290.244±0.2070.196±0.1950.239±0.202
EMNLPmain0.150±0.1270.117±0.1670.117±0.0880.108±0.1040.158±0.1530.183±0.1140.150±0.1270.150±0.1180.200±0.210
ICASSPregular paper0.261±0.2000.275±0.2390.246±0.2400.203±0.2190.232±0.1860.290±0.2520.232±0.2230.217±0.2040.290±0.252
ICCtechnical symposia0.260±0.2470.233±0.2410.287±0.2720.273±0.2490.353±0.2560.260±0.2530.300±0.2620.367±0.2090.273±0.266
ICLRmain0.108±0.1070.115±0.1290.141±0.1170.119±0.1080.148±0.1300.104±0.0930.135±0.1130.108±0.0910.133±0.124
ICMEregular0.276±0.1990.266±0.1810.162±0.1620.297±0.2930.120±0.1550.182±0.1670.234±0.2000.141±0.1130.339±0.323
ICMLmain0.196±0.1370.135±0.1150.186±0.1350.190±0.1500.266±0.1830.170±0.1230.222±0.1660.153±0.1020.167±0.138
ICMLposition0.192±0.1630.187±0.1420.149±0.1100.154±0.1090.215±0.1530.193±0.1660.200±0.1480.163±0.1210.300±0.200
IJCAImain0.159±0.1210.190±0.1410.174±0.1430.198±0.1390.095±0.1300.238±0.1620.214±0.1410.119±0.1920.182±0.142
IJCAIsurvey0.094±0.1290.062±0.1160.188±0.2220.188±0.2220.188±0.2220.188±0.2910.219±0.1600.219±0.1600.125±0.189
KDDdatasets and benchmarks0.214±0.2280.078±0.0500.217±0.2050.175±0.1660.256±0.2080.189±0.1980.186±0.2090.200±0.1740.231±0.188
KDDresearch0.167±0.1800.231±0.1740.180±0.1070.218±0.1580.179±0.1730.205±0.1820.218±0.3220.205±0.1690.218±0.172
NeurIPSdatasets and benchmarks0.193±0.2030.249±0.1960.161±0.1970.170±0.2100.174±0.2020.268±0.1870.268±0.2530.176±0.1870.471±0.265
NeurIPSmain0.183±0.1630.144±0.1410.185±0.1770.207±0.2280.193±0.1580.203±0.1850.210±0.1770.175±0.1250.213±0.188
SIGIRfull paper0.037±0.0840.163±0.2050.138±0.2080.113±0.1710.188±0.2300.237±0.2660.138±0.1710.188±0.2300.138±0.208
SIGIRshort paper0.296±0.2880.132±0.1040.154±0.1920.154±0.1270.154±0.1920.189±0.1510.168±0.2000.225±0.1620.275±0.126
TheWebConfresearch0.141±0.1370.149±0.1540.106±0.0860.126±0.1140.145±0.0990.116±0.0750.140±0.1040.112±0.0870.188±0.177
WACVmain0.236±0.2160.255±0.1290.127±0.1350.200±0.1260.164±0.1210.273±0.2050.273±0.1850.164±0.1750.345±0.181

Understanding NMAE

Interpreting the numbers

  • Lower NMAE = better alignment. Top-performing models stay below 0.2.
  • The ± values show standard deviation, reflecting model stability across papers.

How we calculate it

NMAE=1Ni=1NSgt(i)Spred(i)SmaxSmin\text{NMAE} = \frac{1}{N} \sum_{i=1}^{N} \frac{|S_{\text{gt}}^{(i)} - S_{\text{pred}}^{(i)}|}{S_{\text{max}} - S_{\text{min}}}

NN: the total number of benchmark papers included in the evaluation.

Sgt(i)S_{\text{gt}}^{(i)}: ground-truth overall rating of the i-th benchmark paper.

Spred(i)S_{\text{pred}}^{(i)}: predicted overall rating produced by our CSPR agent.

SmaxS_{\text{max}}&SminS_{\text{min}}: venue-specific upper and lower bounds of the rating scale.

Want to learn more?

Explore the research and technical foundations behind CSPaper (a part of Scholar7):

Get high-quality reviews powered by the best AI models

We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.

Start your free review