π Can We Trust Peer Reviews? A Look at Substantiation in AI/ML Conferences
-
Hi everyone,
I recently read a thought-provoking paper from EMNLP 2023 titled "Automatic Analysis of Substantiation in Scientific Peer Reviews" by Guo et al., and I think itβs worth bringing into our community for discussion.
The Problem: Peer Review Quality is Declining
If youβve submitted to AI/ML conferences lately, you mightβve experienced reviews that feel vague, generic, or just unhelpful. Youβre not alone. The paper highlights a concerning trend: the substantiation level in reviews, how well claims are supported by evidence, has been declining in major NLP conferences over the past few years.
This is likely due to the exploding number of submissions and a shortage of expert reviewers. Combine that with tight deadlines and unclear review guidelines, and you get a perfect storm for poor reviewing practices.
βThe proportion of supported claims in reviews dropped steadily from CoNLL 2016 to ARR 2022.β
β Guo et al. 2023, EMNLP Findings
οΈ The Proposed Solution: argument mining for review analysis
To address this, the authors developed a novel argument mining system that automatically extracts claim-evidence pairs from peer reviews. They even created a dataset called SubstanReview with 550 annotated reviews and introduced a metric called SubstanScore β a quantifiable way to measure review quality based on substantiation.
Highlights:
Defines a new NLP task: claim-evidence pair extraction in peer reviews
Introduces SubstanScore: % of claims backed by evidence Γ review length
Benchmarks fine-tuned transformers like RoBERTa and SpanBERT
Shows ChatGPT underperforms on this task even with detailed prompts
This is not just a cool NLP task; it has real implications for how our scientific community maintains quality and trust.
Questions for the Community
- Should conference chairs integrate automated quality checks (like substantiation analysis) into review processes?
- Would you support a "review scorecard" that flags unsubstantiated or low-quality reviews?
- How can we balance automation with fairness, given that substantiation doesnβt capture other dimensions like factuality or expertise?
- Is it time we start training reviewers explicitly, using tools like SubstanReview as part of reviewer onboarding?
π§ Why This Matters
AI/ML conferences are the heartbeat of our field, and if the gatekeeping mechanism (i.e., peer review) starts to wobble, everything else follows. This paper brings both diagnosis and prescription, and while itβs not a silver bullet, it may just be the kind of infrastructure we need to restore faith in the system.
Paper link: Automatic Analysis of Substantiation in Scientific Peer Reviews (EMNLP 2023)
Dataset & Code: SubstanReview GitHub Repo