Benchmark on LLMs and Venues
We rigorously test leading AI models to ensure you get the most accurate, helpful reviews for your research papers. Below you'll find detailed performance metrics comparing the latest LLMs from OpenAI (GPT) and Google (Gemini) across various top-tier computer science conferences.
Privacy and security first
Benchmark Metrics
Click on a metric below to view detailed benchmarks
Normalized Mean Absolute Error (NMAE) quantifies how close the predicted paper ratings are to the established ground-truth ratings. Lower NMAE values indicate better alignment between predicted and true review ratings.
| Venue | Gemini-2.5-pro | GPT-5.4 | Gemini-3.1-pro | GPT-5 | Gemini-3-flash | GPT-5.1 | Gemini-2.5-flash | GPT-4.1 | GPT-5.2 |
|---|---|---|---|---|---|---|---|---|---|
| ESWAmain | 0.263±0.221 | 0.200±0.215 | 0.156±0.145 | 0.200±0.215 | 0.188±0.189 | 0.125±0.115 | 0.219±0.197 | 0.188±0.133 | 0.200±0.215 |
| FORGEintelligence | 0.041±0.055 | 0.111±0.143 | 0.116±0.117 | 0.066±0.139 | 0.129±0.119 | 0.066±0.092 | 0.341±0.305 | 0.074±0.087 | 0.086±0.143 |
| MLJregular | 0.168±0.134 | 0.242±0.108 | 0.193±0.123 | 0.217±0.165 | 0.177±0.172 | 0.217±0.152 | 0.193±0.123 | 0.247±0.137 | 0.217±0.189 |
| Neurocomputingmain | 0.185±0.150 | 0.275±0.124 | 0.190±0.138 | 0.200±0.142 | 0.180±0.169 | 0.300±0.206 | 0.185±0.172 | 0.155±0.155 | 0.260±0.219 |
| TMLRregular | 0.168±0.134 | 0.242±0.108 | 0.193±0.123 | 0.217±0.165 | 0.177±0.172 | 0.217±0.152 | 0.193±0.123 | 0.247±0.137 | 0.217±0.189 |
| AAAImain technical | 0.097±0.111 | 0.164±0.131 | 0.110±0.102 | 0.111±0.113 | 0.131±0.107 | 0.145±0.129 | 0.105±0.104 | 0.112±0.108 | 0.186±0.154 |
| AAAIsafe and robust AI | 0.214±0.249 | 0.214±0.227 | 0.230±0.245 | 0.159±0.207 | 0.175±0.274 | 0.238±0.241 | 0.151±0.204 | 0.167±0.185 | 0.286±0.211 |
| AAMASmain technical | 0.079±0.098 | 0.134±0.105 | 0.099±0.105 | 0.210±0.168 | 0.106±0.086 | 0.183±0.196 | 0.085±0.092 | 0.127±0.106 | 0.252±0.230 |
| ACLmain | 0.153±0.135 | 0.153±0.145 | 0.208±0.154 | 0.153±0.159 | 0.194±0.132 | 0.208±0.195 | 0.185±0.176 | 0.162±0.115 | 0.199±0.188 |
| AISTATSmain | 0.212±0.171 | 0.266±0.261 | 0.176±0.185 | 0.231±0.178 | 0.179±0.160 | 0.231±0.178 | 0.251±0.176 | 0.223±0.189 | 0.302±0.256 |
| CoGtechnical and vision | 0.163±0.227 | 0.100±0.124 | 0.137±0.174 | 0.155±0.134 | 0.211±0.220 | 0.107±0.128 | 0.230±0.247 | 0.200±0.227 | 0.211±0.244 |
| CVPRmain | 0.201±0.213 | 0.242±0.216 | 0.228±0.205 | 0.154±0.181 | 0.192±0.194 | 0.249±0.229 | 0.244±0.207 | 0.196±0.195 | 0.239±0.202 |
| EMNLPmain | 0.150±0.127 | 0.117±0.167 | 0.117±0.088 | 0.108±0.104 | 0.158±0.153 | 0.183±0.114 | 0.150±0.127 | 0.150±0.118 | 0.200±0.210 |
| ICASSPregular paper | 0.261±0.200 | 0.275±0.239 | 0.246±0.240 | 0.203±0.219 | 0.232±0.186 | 0.290±0.252 | 0.232±0.223 | 0.217±0.204 | 0.290±0.252 |
| ICCtechnical symposia | 0.260±0.247 | 0.233±0.241 | 0.287±0.272 | 0.273±0.249 | 0.353±0.256 | 0.260±0.253 | 0.300±0.262 | 0.367±0.209 | 0.273±0.266 |
| ICLRmain | 0.108±0.107 | 0.115±0.129 | 0.141±0.117 | 0.119±0.108 | 0.148±0.130 | 0.104±0.093 | 0.135±0.113 | 0.108±0.091 | 0.133±0.124 |
| ICMEregular | 0.276±0.199 | 0.266±0.181 | 0.162±0.162 | 0.297±0.293 | 0.120±0.155 | 0.182±0.167 | 0.234±0.200 | 0.141±0.113 | 0.339±0.323 |
| ICMLmain | 0.196±0.137 | 0.135±0.115 | 0.186±0.135 | 0.190±0.150 | 0.266±0.183 | 0.170±0.123 | 0.222±0.166 | 0.153±0.102 | 0.167±0.138 |
| ICMLposition | 0.192±0.163 | 0.187±0.142 | 0.149±0.110 | 0.154±0.109 | 0.215±0.153 | 0.193±0.166 | 0.200±0.148 | 0.163±0.121 | 0.300±0.200 |
| IJCAImain | 0.159±0.121 | 0.190±0.141 | 0.174±0.143 | 0.198±0.139 | 0.095±0.130 | 0.238±0.162 | 0.214±0.141 | 0.119±0.192 | 0.182±0.142 |
| IJCAIsurvey | 0.094±0.129 | 0.062±0.116 | 0.188±0.222 | 0.188±0.222 | 0.188±0.222 | 0.188±0.291 | 0.219±0.160 | 0.219±0.160 | 0.125±0.189 |
| KDDdatasets and benchmarks | 0.214±0.228 | 0.078±0.050 | 0.217±0.205 | 0.175±0.166 | 0.256±0.208 | 0.189±0.198 | 0.186±0.209 | 0.200±0.174 | 0.231±0.188 |
| KDDresearch | 0.167±0.180 | 0.231±0.174 | 0.180±0.107 | 0.218±0.158 | 0.179±0.173 | 0.205±0.182 | 0.218±0.322 | 0.205±0.169 | 0.218±0.172 |
| NeurIPSdatasets and benchmarks | 0.193±0.203 | 0.249±0.196 | 0.161±0.197 | 0.170±0.210 | 0.174±0.202 | 0.268±0.187 | 0.268±0.253 | 0.176±0.187 | 0.471±0.265 |
| NeurIPSmain | 0.183±0.163 | 0.144±0.141 | 0.185±0.177 | 0.207±0.228 | 0.193±0.158 | 0.203±0.185 | 0.210±0.177 | 0.175±0.125 | 0.213±0.188 |
| SIGIRfull paper | 0.037±0.084 | 0.163±0.205 | 0.138±0.208 | 0.113±0.171 | 0.188±0.230 | 0.237±0.266 | 0.138±0.171 | 0.188±0.230 | 0.138±0.208 |
| SIGIRshort paper | 0.296±0.288 | 0.132±0.104 | 0.154±0.192 | 0.154±0.127 | 0.154±0.192 | 0.189±0.151 | 0.168±0.200 | 0.225±0.162 | 0.275±0.126 |
| TheWebConfresearch | 0.141±0.137 | 0.149±0.154 | 0.106±0.086 | 0.126±0.114 | 0.145±0.099 | 0.116±0.075 | 0.140±0.104 | 0.112±0.087 | 0.188±0.177 |
| WACVmain | 0.236±0.216 | 0.255±0.129 | 0.127±0.135 | 0.200±0.126 | 0.164±0.121 | 0.273±0.205 | 0.273±0.185 | 0.164±0.175 | 0.345±0.181 |
Understanding NMAE
Interpreting the numbers
- Lower NMAE = better alignment. Top-performing models stay below 0.2.
- The ± values show standard deviation, reflecting model stability across papers.
How we calculate it
: the total number of benchmark papers included in the evaluation.
: ground-truth overall rating of the i-th benchmark paper.
: predicted overall rating produced by our CSPR agent.
&: venue-specific upper and lower bounds of the rating scale.
Want to learn more?
Explore the research and technical foundations behind CSPaper (a part of Scholar7):
- Preventing the Collapse of Peer Review Requires Verification-First AI
- CSPaper Review: Fast, Rubric-Faithful Conference Feedback (INLG 2025)
- Adopt Machine-Human Collaboration Peer-Review through Computational Research Assessment
- Epistemic Throughput: Fundamental Limits of Attention-Constrained Inference
Get high-quality reviews powered by the best AI models
We continuously evaluate and select the best-performing models to ensure you receive accurate, actionable feedback for your research.
Start your free review