ICML 2025 Review – Most Outstanding Issues
Sources are labeled whenever suited
1. 🧾 Incomplete / Low-Quality Reviews
Several submissions received no reviews at all (Zhihu).
Single-review papers despite multi-review policy.
Some reviewers appeared to skim or misunderstand the paper.
Accusations that reviews were LLM-generated: generic, hallucinated, overly verbose (Reddit).
2. Unjustified Low Scores
Reviews lacked substantive critique but gave 1 or 2 scores without explanation.
Cases where positive commentary was followed by a low score (e.g., "Good paper" + score 2).
Reviewers pushing personal biases (e.g., “you didn’t cite my 5 papers”).
3. 🧠 Domain Mismatch
Theoretical reviewers assigned empirical papers and vice versa (Zhihu).
Reviewers struggling with areas outside their expertise, leading to incorrect comments.
4. Rebuttal System Frustrations
5000-character rebuttal limit per reviewer too short to address all concerns.
Markdown formatting restrictions (e.g., no multiple boxes, limited links).
Reviewers acknowledged rebuttal but did not adjust scores.
Authors felt rebuttal phase was performative rather than impactful.
5. 🪵 Bureaucratic Review Process
Reviewers forced to fill out many structured fields: "claims & evidence", "broader impact", etc.
Complaint: “Too much form-filling, not enough science” (Zhihu).
6. Noisy and Arbitrary Scoring
Extreme score variance within a single paper (e.g., 1/3/5).
Scores didn’t align with review contents or compared results.
Unclear thresholds and lack of transparency in AC decision-making.
7. Suspected LLM Reviews (Reddit-specific)
Reviewers suspected of using LLMs to generate long, vague reviews.
Multiple users ran reviews through tools like GPTZero / DeepSeek and got LLM flags.
8. Burnout and Overload
Reviewers overloaded with 5 papers, many outside comfort zone.
No option to reduce load, leading to surface-level reviews.
Authors and reviewers both expressed mental exhaustion.
9. Review Mismatch with Paper Goals
Reviewers asked for experiments outside scope or compute budget (e.g., run LLM baselines).
Demands for comparisons against outdated or irrelevant benchmarks.
10. ️ Lack of Accountability / Transparency
Authors wished for reviewer identity disclosure post-discussion to encourage accountability.
Inconsistent handling of rebuttal responses across different ACs and tracks.