🔍 Can We Trust Peer Reviews? A Look at Substantiation in AI/ML Conferences

cqsyf

Hi everyone,

I recently read a thought-provoking paper from EMNLP 2023 titled "Automatic Analysis of Substantiation in Scientific Peer Reviews" by Guo et al., and I think it’s worth bringing into our community for discussion.

The Problem: Peer Review Quality is Declining

teaser-review-quality

If you’ve submitted to AI/ML conferences lately, you might’ve experienced reviews that feel vague, generic, or just unhelpful. You’re not alone. The paper highlights a concerning trend: the substantiation level in reviews, how well claims are supported by evidence, has been declining in major NLP conferences over the past few years.

This is likely due to the exploding number of submissions and a shortage of expert reviewers. Combine that with tight deadlines and unclear review guidelines, and you get a perfect storm for poor reviewing practices.

“The proportion of supported claims in reviews dropped steadily from CoNLL 2016 to ARR 2022.”
— Guo et al. 2023, EMNLP Findings

️ The Proposed Solution: argument mining for review analysis

To address this, the authors developed a novel argument mining system that automatically extracts claim-evidence pairs from peer reviews. They even created a dataset called SubstanReview with 550 annotated reviews and introduced a metric called SubstanScore — a quantifiable way to measure review quality based on substantiation.

Highlights:

Defines a new NLP task: claim-evidence pair extraction in peer reviews
Introduces SubstanScore: % of claims backed by evidence × review length
Benchmarks fine-tuned transformers like RoBERTa and SpanBERT
Shows ChatGPT underperforms on this task even with detailed prompts

This is not just a cool NLP task; it has real implications for how our scientific community maintains quality and trust.

Questions for the Community

Should conference chairs integrate automated quality checks (like substantiation analysis) into review processes?
Would you support a "review scorecard" that flags unsubstantiated or low-quality reviews?
How can we balance automation with fairness, given that substantiation doesn’t capture other dimensions like factuality or expertise?
Is it time we start training reviewers explicitly, using tools like SubstanReview as part of reviewer onboarding?

🧭 Why This Matters

AI/ML conferences are the heartbeat of our field, and if the gatekeeping mechanism (i.e., peer review) starts to wobble, everything else follows. This paper brings both diagnosis and prescription, and while it’s not a silver bullet, it may just be the kind of infrastructure we need to restore faith in the system.

Paper link: Automatic Analysis of Substantiation in Scientific Peer Reviews (EMNLP 2023)
Dataset & Code: SubstanReview GitHub Repo

root

Interesting research that got accepted by EMNLP 2023 findings.

CSPaper: peer review sidekick