ICML 2025 Review Controversies Spark Academic Debate

Joanne · wrote on 2 May 2025, 20:03

The ICML 2025 acceptance results have recently been announced, marking a historic high with 12,107 valid submissions, resulting in 3,260 accepted papers—an acceptance rate of 26.9%. Despite the impressive volume, numerous serious issues in the review process have emerged, sparking extensive discussions within the academic community.

Highlighted Issues

Inconsistency between review scores and acceptance outcomes
Haifeng Xu, Professor at the University of Chicago, observed that review scores at ICML 2025 were oddly disconnected from acceptance outcomes. Of his four submissions, the paper with the lowest average score (2.75) was accepted as a poster, while the three papers with higher scores (3.0) were rejected.
Positive reviews yet inexplicable rejection
A researcher from KAUST reported that his submission received uniformly positive reviews, clearly affirming its theoretical and empirical contributions, yet it was rejected without any negative feedback or explanation.
Errors in review-score documentation
Zhiqiang Shen, Assistant Professor at MBZUAI, highlighted significant recording errors. One paper, clearly rated with two "4" scores, was mistakenly documented in the meta-review as having "three 3’s and one 4". Another paper suffered rejection based on outdated reviewer comments, ignoring the updated scores from reviewers during the rebuttal period.
Unjustified rejection by Area Chair
Mengmi Zhang, Assistant Professor at NTU, experienced a perplexing case where her paper was rejected by the Area Chair despite unanimous approval from all reviewers, with no rationale provided.
Incomplete review submissions
A doctoral student from York University reported incomplete reviews were submitted for his paper, yet the Area Chair cited these incomplete reviews as justification for rejection.
Zero-sum game and unfair review criteria
A reviewer from UT publicly criticized the reviewing criteria, lamenting overly lenient reviews in the past. He highlighted a troubling trend: submissions not employing at least 30 trillion tokens to train 671B MoE models risk rejection regardless of their theoretical strength.

Additionally, several researchers noted suspiciously AI-generated or carelessly copy-pasted reviews, causing contradictory feedback.

Notable Achievements

Despite these controversies, several research groups achieved remarkable outcomes among others:

Duke University (Prof. Yiran Chen’s team): 5 papers accepted, including 1 spotlight poster.
Peking University (Prof. Ming Zhang’s team): 4 papers accepted for the second consecutive year.
UC Berkeley (Dr. Xuandong Zhao): 3 papers accepted.

Open Discussion

Given these significant reviewing issues—including reviewer negligence, procedural chaos, and immature AI-assisted review systems—how should top-tier academic conferences reform their processes to ensure fairness and enhance review quality?

We invite everyone to share your thoughts, experiences, and constructive suggestions!

Joserffrey · wrote on 4 May 2025, 03:01

It seems that reviewers do not have permission to view the ACs' meta-review and PCs' final decision this year. As a reviewer, I cannot see results of the submissions I reviewed.

cqsyf · 7 May 2025, 22:57

My colleague is serving as a Program Committee (PC) member for this year’s ICML. According to her, some individuals were selected as reviewers solely based on having co-authored a previous ICML paper. Upon investigating the backgrounds of certain reviewers who appeared to submit problematic reviews, she discovered that many of them lacked even a bachelor’s degree; for instance, some were first-year undergraduate students

Joserffrey · C cqsyf 7 May 2025, 19:33

@cqsyf Perhaps we should prepare ourselves mentally for this to become the norm. AFAIK, NIPS'25 already has PhD students as ACs, and undergraduate reviewers are even more common as reviewers. This is really terrible.

cocktailfreedom · wrote on 8 May 2025, 21:43

With such a pace of submission increase year-o-year, I can not think of a way how this manual review effort may work well!!

root · wrote on 8 May 2025, 21:51

This thread vividly highlights what seems to be an ironic paradox in the academic community: the more papers we submit, the less time we have left to properly review them!

Think about it, researchers are now spending countless hours crafting submissions to reach record-breaking numbers at conferences like ICML 2025. Yet, this surge in submissions might be directly correlated with declining review quality. It's like we're baking thousands of cakes and then complaining that no one has time to taste them properly.

Perhaps we’re witnessing a "submission-reviewer paradox": the energy invested in authoring more papers inevitably leaves us with fewer resources for thorough and careful reviewing.

Could the solution be smarter automation, stricter reviewer qualifications, or maybe even rethinking how conferences handle volume altogether

Joanne · wrote on 9 May 2025, 16:56

Seriously, "first-year undergraduate students" as reviewer?!

Joanne · wrote on 9 May 2025, 17:04

EMNLP submissions could skyrocket past 10,000 this year. The speed of this growth is astonishing, reflecting just how rapidly the field is expanding. These top-tier conferences attract the best authors and have the privilege to have the most capable reviewers. Hopefully, this won’t discourage authors.

Sylvia · wrote on 11 Jul 2025, 21:14

I put together a structured overview of all 120 oral papers accepted at ICML 2025, categorized by research topic. The summary is aimed at the CS research review community and highlights trends, innovations, and open questions in the field.

1. Foundation Models, LLMs, & Multimodal AI

Layer by Layer: Uncovering Hidden Representations in Language Models
Explores the structure and semantics of intermediate representations in large language models.
Learning Dynamics in Continual Pre-Training for Large Language Models
Studies how continual pre-training affects the learning dynamics and knowledge retention of LLMs.
Emergent Misalignment: Narrow Finetuning can Produce Broadly Misaligned LLMs
Shows that targeted finetuning may create unexpected, broad misalignment issues in LLMs.
CollabLLM: From Passive Responders to Active Collaborators
Proposes techniques for LLMs to act as proactive, context-aware collaborators.
AffectGPT: A New Dataset, Model, and Benchmark for Emotion Understanding with Multimodal Large Language Models
Introduces a dataset and evaluation suite for emotion understanding in multimodal LLMs.
On Path to Multimodal Generalist: General-Level and General-Bench
Presents a unified framework and benchmarks for developing generalist multimodal AI models.
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
Proposes a suite for benchmarking multimodal models in embodied agent tasks.
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Scalable synthetic data pipeline for visual question answering with multimodal LLMs.
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
Benchmarks multimodal reasoning skills in LLMs.
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
Uses synthetic reasoning to learn reward models for multi-domain processes.
Sundial: A Family of Highly Capable Time Series Foundation Models
Introduces a family of foundation models for time series data.
Retrieval-Augmented Perception: High-resolution Image Perception Meets Visual RAG
Combines retrieval-augmented generation with high-res visual perception.
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities
Introduces a broad benchmark for virtual agents’ key capabilities.

2. Representation Learning & Theory

An analytic theory of creativity in convolutional diffusion models
Develops a mechanistic, interpretable theory of creativity in diffusion models.
Strategy Coopetition Explains the Emergence and Transience of In-Context Learning
Theorizes about the mechanisms behind in-context learning in large models.
Scaling Collapse Reveals Universal Dynamics in Compute-Optimally Trained Neural Networks
Discovers universal scaling laws in optimally trained neural networks.
Transformative or Conservative? Conservation laws for ResNets and Transformers
Connects conservation laws to deep architectures.
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
Explores “grokking” in non-neural computational models.
Learning with Expected Signatures: Theory and Applications
Presents a new mathematical framework for sequential data representations.
General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization
Theoretical advances in stochastic optimization.
Equivalence is All: A Unified View for Self-supervised Graph Learning
Unifies self-supervised objectives in graph learning under an equivalence framework.
Blink of an eye: a simple theory for feature localization in generative models
Theoretical work on feature localization.
Expected Variational Inequalities
Introduces variational inequalities in expectation as a new analytical tool.
Partition First, Embed Later: Laplacian-Based Feature Partitioning for Refined Embedding and Visualization of High-Dimensional Data
Laplacian-based methods for dimensionality reduction.
Learning Time-Varying Multi-Region Brain Communications via Scalable Markovian Gaussian Processes
New models for dynamic brain network analysis.

3. Diffusion, Generative Models & Creativity

VideoRoPE: What Makes for Good Video Rotary Position Embedding?
Advances rotary position embeddings for video modeling.
ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features
Shows how diffusion transformers develop interpretable features.
MGD³: Mode-Guided Dataset Distillation using Diffusion Models
Applies diffusion models to dataset distillation.
DeFoG: Discrete Flow Matching for Graph Generation
Diffusion-based approaches for graph generative models.
Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions
Explores token ordering effects in diffusion-based text generation.
Normalizing Flows are Capable Generative Models
Revisits normalizing flows for scalable generative modeling.
Score Matching with Missing Data
Score-based generative models for incomplete data.

4. Optimization, Theory & Algorithms

Algorithm Development in Neural Networks: Insights from the Streaming Parity Task
Theoretical analysis of algorithmic problem-solving in neural nets.
An Online Adaptive Sampling Algorithm for Stochastic Difference-of-convex Optimization with Time-varying Distributions
Novel optimization algorithms for dynamic settings.
Nonlinearly Preconditioned Gradient Methods under Generalized Smoothness
Advances in preconditioned optimization.
Fundamental Bias in Inverting Random Sampling Matrices with Application to Sub-sampled Newton
Analyses the bias in random matrix inversion.
One-Step Generalization Ratio Guided Optimization for Domain Generalization
Introduces a new optimization criterion for domain generalization.
Polynomial-Delay MAG Listing with Novel Locally Complete Orientation Rules
Graph-theoretic algorithms.
An Improved Clique-Picking Algorithm for Counting Markov Equivalent DAGs via Super Cliques Transfer
Faster algorithms for counting Markov equivalence classes.
Near-Optimal Decision Trees in a SPLIT Second
Develops new algorithms for fast, near-optimal decision tree learning.
A Generalization Result for Convergence in Learning-to-Optimize
Generalization bounds in meta-optimization.
LoRA Training Provably Converges to a Low-Rank Global Minimum Or It Fails Loudly (But it Probably Won't Fail)
Theoretical convergence results for LoRA.
LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently
Single-step gradient fine-tuning for LLMs.
Implicit Regularization for Tubal Tensor Factorizations via Gradient Descent
Implicit regularization in tensor methods.

5. Reinforcement Learning, Agents & Decision Making

Multi-agent Architecture Search via Agentic Supernet
Automated design of multi-agent systems.
Training a Generally Curious Agent
Advances in curiosity-driven exploration.
Controlling Underestimation Bias in Constrained Reinforcement Learning for Safe Exploration
Methods for safer RL via bias correction.
Temporal Difference Flows
Temporal difference learning with flow-based methods.
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
Sparse networks for scalable RL.
Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination
Transferable cooperation in multi-agent RL.
VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data
Synthetic reasoning for complex RL reward modeling.
High-Dimensional Prediction for Sequential Decision Making
Learning for high-dimensional decision-making tasks.

6. Robustness, Safety, Privacy & Security

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection
Detects AI-generated images using robust subspace methods.
Position: Certified Robustness Does Not (Yet) Imply Model Security
Argues the gap between robustness guarantees and practical security.
Adversarial Inception Backdoor Attacks against Reinforcement Learning
Examines vulnerabilities in RL to backdoor attacks.
AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses
Benchmarks for adversarial example defenses.
Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings
Provides new tools for deployment-oriented classifier evaluation.
Auditing f-differential privacy in one run
Practical privacy auditing for learning algorithms.
On Differential Privacy for Adaptively Solving Search Problems via Sketching
Differential privacy in adaptive search.
Going Deeper into Locally Differentially Private Graph Neural Networks
Privacy-preserving learning on graphs.

7. Causality, Generalization & Explainability

Position: Not All Explanations for Deep Learning Phenomena Are Equally Valuable
Calls for careful evaluation of explanation quality.
Sanity Checking Causal Representation Learning on a Simple Real-World System
Evaluates causal representation learning with real data.
Statistical Test for Feature Selection Pipelines by Selective Inference
Selective inference in feature selection.
A Generalization Theory for Zero-Shot Prediction
New theory for zero-shot generalization.
Statistical Collusion by Collectives on Learning Platforms
Examines collective manipulation in ML platforms.
Navigating Semantic Drift in Task-Agnostic Class-Incremental Learning
Tackles drift in continual learning scenarios.
Generalization Result for Convergence in Learning-to-Optimize
Generalization in meta-learning.

8. Scientific Discovery, Mathematics & Symbolic Reasoning

LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
Benchmarks scientific equation discovery with LLMs.
Neural Discovery in Mathematics: Do Machines Dream of Colored Planes?
ML for conjecturing in mathematics.
Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics
Datasets for symbolic mathematical discovery.
From Weight-Based to State-Based Fine-Tuning: Further Memory Reduction on LoRA with Parallel Control
Memory-efficient fine-tuning methods for LLMs.

9. Vision, Video, Perception & Multimodal

ReferSplat: Referring Segmentation in 3D Gaussian Splatting
Novel approach for referring segmentation in 3D.
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
Integrates appearance and motion for video generation.
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
(Duplicate with above, retained for emphasis on video modeling.)

10. Data, Scaling Laws & Evaluation

Improving the Scaling Laws of Synthetic Data with Deliberate Practice
Deliberate practice for synthetic data generation.
Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection
Multi-model approaches for subset selection.
Mixture of Lookup Experts
Scalable expert mixture models.
Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models
Analyzes and identifies bad training samples.
Inductive Moment Matching
Moment matching for robust model learning.
Prices, Bids, Values: One ML-Powered Combinatorial Auction to Rule Them All
ML for combinatorial auctions.

11. Policy, Society, and Position Papers

Position: The AI Conference Peer Review Crisis Demands Author Feedback and Reviewer Rewards
Calls for reforms in peer review.
Position: Probabilistic Modelling is Sufficient for Causal Inference
Argues for the adequacy of probabilistic modeling for causal inference.
Position: Generative AI Regulation Can Learn from Social Media Regulation
Draws parallels between AI and social media regulation.
Position: Current Model Licensing Practices are Dragging Us into a Quagmire of Legal Noncompliance
Highlights legal risks in model licensing.
Position: AI Agents Need Authenticated Delegation
Argues for delegation mechanisms in AI agents.
Position: AI Safety should prioritize the Future of Work
Suggests work-focused priorities for AI safety.
Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity
Pushes for rigorous benchmarking in medical AI.
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
Empirical evaluation via competitions.
Position: Principles of Animal Cognition to Improve LLM Evaluations
Inspiration from animal cognition for evaluation.
Position: Political Neutrality in AI Is Impossible — But Here Is How to Approximate It
Discusses challenges and solutions for political neutrality in AI.

12. Miscellaneous: Specialized Models & Systems

Rényi Neural Processes
Probabilistic neural processes with Rényi divergences.
The dark side of the forces: assessing non-conservative force models for atomistic machine learning
Physics-inspired ML models.
AutoGFM: Automated Graph Foundation Model with Adaptive Architecture Customization
Foundation models for graphs.
ITBench: Evaluating AI Agents across Diverse Real-World IT Automation Tasks
IT automation evaluation suite.
STAIR: Improving Safety Alignment with Introspective Reasoning
Safety alignment via introspective reasoning.
Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings
Deployment-oriented classifier evaluation.

This summary omits author details for brevity and focuses solely on research content and topics.

root

Here's a glimpse of some truly remarkable work recognized this year:

Outstanding Papers:

Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Vaishnavh Nagarajan, Chen Wu, Charles Ding, Aditi Raghunathan
The Value of Prediction in Identifying the Worst-Off
Unai Fischer Abaigar, Christoph Kern, Juan Perdomo
Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions
Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham Kakade, Sitan Chen
Score Matching with Missing Data
Josh Givens, Song Liu, Henry Reeve
CollabLLM: From Passive Responders to Active Collaborators
Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, Jianfeng Gao
Conformal Prediction as Bayesian Quadrature
Jake Snell, Thomas Griffiths

Outstanding Position Papers:

AI Safety should prioritize the Future of Work
Sanchaita Hazra, Bodhisattwa Prasad Majumder, Tuhin Chakrabarty
The AI Conference Peer Review Crisis Demands Author Feedback and Reviewer Rewards
Jaeho Kim, Yunseok Lee, Seulki Lee

See all awards and details here

CSPaper: peer review sidekick