Skip to content
  • Categories
  • CSPaper Review
  • Recent
  • Tags
  • Popular
  • World
  • Paper Copilot
  • OpenReview.net
  • Deadlines
  • CSRanking
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
CSPaper

CSPaper: peer review sidekick

  1. Home
  2. Using CSPaper Review Tool: Questions, Feedback & Ideas
  3. πŸ” GPT-5 for Paper Review? CSPaper Benchmark Results May Surprise You! πŸ€–πŸ“Š

πŸ” GPT-5 for Paper Review? CSPaper Benchmark Results May Surprise You! πŸ€–πŸ“Š

Scheduled Pinned Locked Moved Using CSPaper Review Tool: Questions, Feedback & Ideas
gpt-5cspapercspaper reviewbenchmarkbenchmarking resultcomparisongpt-4o4-minideepseekllama
1 Posts 1 Posters 99 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • rootR Offline
    rootR Offline
    root
    wrote last edited by root
    #1

    gpt5-cspaper-review-benchmarking.jpg

    With the release of GPT-5, the AI research community has been buzzing β€” some praising its brilliance, others lamenting its inconsistency. Just take a glance at the mixed feedback across channels like this sharing and discussion or this discussion on prompt engineering. Is GPT-5 really better, or does it just need the right prompt? πŸ€”

    At CSPaper Review, we let data do the talking.

    πŸ§ͺ Benchmarking GPT-5: How Does It Perform?

    We evaluated GPT-5 on our small (for now) scale benchmark consisting of 100 research papers across top-tier CS conferences. Each paper has ground-truth overall review scores. Our comparison metric? Mean Absolute Error (MAE) between model prediction and human review scores. Lower is better.

    Here's the result:

    Screenshot 2025-08-11 at 14.47.24.png

    _Note: β€œDS3” refers to DeepSeek-v3. Bold = best, underlined = second-best.

    πŸ” Key Observations:

    • GPT-5 achieves second-best performance in 7 out of 8 conferences, but is never the best.
    • GPT-4.1 consistently ranks among the top, being best in 6 out of 8 benchmarks.
    • Lightweight versions like o4-mini lag behind in most cases.
    • LLaMA3 and DeepSeek generally trail behind the GPT-family models.

    So despite all the hype, GPT-5 is solid but not clearly superior β€” at least, not without further tuning or prompting.


    πŸ“ But Is GPT-5 More "Verbose"?

    We also examined each model’s average output length (count of generated words in resulting reviews) β€” a proxy for verbosity and elaboration in review generation:

    Conference GPT-5 GPT-4.1 o3 o4-mini
    Mean length 12821 11111 8184 8137

    πŸ’¬ Commentary:

    • GPT-5 generates the longest reviews, by a margin of ~1,700 words over GPT-4.1 and nearly 50% longer than o3/o4-mini.
    • This suggests GPT-5 might elaborate more β€” potentially helpful in nuanced cases, but also possibly more prone to redundancy or hallucination.

    βœ… Conclusion: Not Switching... Yet

    Despite its capabilities, GPT-5 does not yet justify replacing the best performing models in our pipeline. It performs reasonably but does not significantly outperform the current setup in either accuracy or efficiency.

    That said, we’re not closing the door. If future prompting strategies or fine-tuning tricks can unlock GPT-5’s full potential on peer review tasks, we might revisit.

    Stay tuned for more benchmarks and improvements at CSPaper Review. πŸ§ πŸ“„

    We plan to share these results and more of our learnings on a short research paper soon. For anyone interested, please stay tuned ...

    1 Reply Last reply
    0
    Reply
    • Reply as topic
    Log in to reply
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes


    • Login

    • Don't have an account? Register

    • Login or register to search.
    Β© 2025 CSPaper.org Sidekick of Peer Reviews
    Debating the highs and lows of peer review in computer science.
    • First post
      Last post
    0
    • Categories
    • CSPaper Review
    • Recent
    • Tags
    • Popular
    • World
    • Paper Copilot
    • OpenReview.net
    • Deadlines
    • CSRanking