CSPaper: sidekick of peer review

river

I made a summary of data points from KDD 2025 1st round results:

Novelty Scores	Technical Quality Scores	Confidence Scores	Rebuttal Outcome	Final Decision	Notes
3 3 3 3 3 3	4 3 3 2 3 2	–	Addressed issues	Accepted	"Rebuttal 一波三折太难了"
2 2 3 2 2	3 3 2 2 3	3 3 3 3 3	Submitted	Rejected	"是不是可以直接跑路了"
4 3 3 1	4 4 2 2	–	Explained issues	Rejected	"Large variance across reviewers; no score changes post-rebuttal"
3 3 3	3 3 2	–	Unsure	🟡 Unknown	"Still considering rebuttal; not sure if it's worth the effort"
3 3 3 3 3 3	3 3 3 3 3 2	–	Minor clarifications	Accepted	"Final scores unchanged but accepted after positive AC decision"
3 4 3 3 3 3	2 2 3 2 2 3	–	Clarified results	Rejected	"Novelty OK, but TQ too weak; didn't convince reviewers"
3 3 3 3 3 3	3 3 3 3 3 3	3 3 3 3 3	Submitted	Accepted	"Strong consensus; one of the smoother cases"
3 3 3	3 3 2	–	No rebuttal	Rejected	"No rebuttal submitted; borderline scores"
3 3 2 2	3 3 2 2	–	Rebuttal sent	Rejected	"Reviewers did not change their opinion"
3 3 3 3 3 3	3 3 3 3 2 2	–	Rebuttal helped	Accepted	"Accepted despite one weaker reviewer"
3 3 3 3	3 3 3 3	3 3 3 3	Rebuttal sent	🟡 Unknown	"In limbo; waiting for final decision"
3 3 3 3	2 2 2 2	–	Not convincing	Rejected	"Work deemed not ‘KDD-level’ despite rebuttal"
3 3 3 3 3 3	3 3 3 3 3 3	3 3 3 3 3	Submitted	Accepted	"Perfectly consistent reviewers; smooth acceptance"
3 3 3 2	3 3 2 2	–	Rebuttal failed	Rejected	"Low technical quality and variance led to rejection"

Note: Data sourced from community discussions on Zhihu, Reddit, and OpenReview threads. Subject to sample bias.

river

@lelecao I feel you, been there too!

river

KDD community stats

https://papercopilot.com/statistics/kdd-statistics/kdd-2025-statistics/

Screenshot 2025-04-04 at 11.11.52.png

river

The Verdict: ACL 2025 Review Scores Decoded

This year’s Overall Assessment (OA) descriptions reveal a brutal hierarchy:

5.0 "Award-worthy (top 2.5%)"
️ 4.0 "ACL-worthy"
3.5 "Borderline Conference"
3.0 "Findings-tier" (Translation: "We’ll take it… but hide it in the appendix")
1.0 "Do not resubmit" (a.k.a. "Burn this and start over")

Pro tip: A 3.5+ OA avg likely means main conference; 3.0+ scraps into Findings. Meta-reviewers now hold life-or-death power—one 4.0 can save a 3.0 from oblivion.

Nightmare Fuel: The 6-Reviewer Special

"Some papers got 6 reviewers—likely because emergency reviewers were drafted last-minute. Imagine rebutting 6 conflicting opinions… while praying the meta-reviewer actually reads your response."

Rebuttal strategy:

2.0? "Give up." (Odds of salvation: ~0%)
2.5? "Worth a shot."
3.0? "Fight like hell."

The ARR Meat Grinder Just Got Worse

New changes to the ARR (Academic Rebuttal Rumble):

5 cycles/year now (April’s cycle vanished; June moved to May).
EMNLP’s deadline looms closer — less time to pivot after ACL rejections.
LLM stampede: *"8,000+ submissions per ARR cycle!

"Back in the days, ACL had 3,000 submissions. No Findings, no ARR, no LLM hype-train. Now it’s just a content farm with peer review."

How to Survive the Madness

Got a 3.0? Pray your meta-reviewer is merciful.
🤬 Toxic review? File an "issue" (but expect crickets).
ARR loophole: "Score low in Feb? Resubmit to May ARR and aim for EMNLP."

The Big Picture: NLP’s Broken Incentives

Reviewer fatigue: Emergency reviewers = rushed, clueless feedback.
LLM monoculture: 90% of papers are "We scaled it bigger" or "Here’s a new benchmark for our 0.2% SOTA."
Findings graveyard: Where "technically sound but unsexy" papers go to die.

Final thought: "If you’re not gaming the system, the system is gaming you."

Adapted from JOJO极智算法 (2025-03-28)

Share your ACL 2025 horror stories below! Did you rebut or run?

river

Originally posted by Zhihu user “灰瞳六分仪” (Huítóng Liùfēnyí) on: March 28, 2025, 16:47

I’d like to point out a paper that I found rather disturbing — I noticed that the ICLR 2025 paper Mixture of Attentions (MoA) is strikingly similar to the ICML 2024 paper GliDe. Let’s first drop the links to both papers:

MoA: https://openreview.net/pdf?id=Rz0kozh3LE

GliDe: https://openreview.net/pdf?id=mk8oRhox2I

To be honest, I initially had no intention of bringing this to Zhihu — after all, allegations of potential academic misconduct are very serious. So I first posted a public comment on OpenReview, politely asking the authors for clarification. But the author’s response and the excuse made in the camera-ready version really infuriated me. So now I’m going to break this down properly.

The Excuse That Made Me Mad

Let’s look at how the authors themselves explained the difference between MoA and GliDe in the camera-ready version:

Du et al. (2024) previously proposed to leverage the KV-cache of some layers of 𝓜_{large}. They do not justify why using the KV-cache instead of the output of each layer, nor how to exactly choose which layer to include as input of 𝓜_{small} However, with our dynamical system point of view, we showed that the KV-cache of all the layers is part of the state. The introduction of LSA allows to exploit it in its whole with a limited number of layers, whereas Du et al. (2024) would need to have the same number of layers in 𝓜_{small} and 𝓜_{large} to fully capture it, resulting in a slow drafting speed.

This explanation claims a “lack of justification” from Du et al. (2024), but…

This Is Just Wrong.

GliDe only used the last layer’s KV cache — and explicitly ran ablation studies to justify this. Check Figure 7 of the GliDe paper if you don’t believe me.

Such a misleading statement raises the question: Were MoA’s authors intentionally trying to confuse readers and obscure the similarity between the two works?

A Side-by-Side Comparison of Core Methods

Let me help the MoA authors do a better comparison between GliDe and MoA. Below are the main method diagrams from both papers:

Screenshot 2025-03-28 at 19.14.32.png

So yeah, the core frameworks are nearly identical. If we’re nitpicking, the difference is:

GliDe uses only the last layer’s KV cache
MoA takes the average across all layers’ KV cache

I’ve summarized this comparison in the table below.

Method	GliDe (ICML 2024)	MoA (ICLR 2025)
Core Process	Pass input through the target model (large model)'s embedding layer, then generate query via self-attention, and finally compute cross-attention with the target model's KV cache	Same as GliDe: embedding → self-attention → compute cross-attention with KV cache from the target model
Use of Target Model's KV Cache	Only uses the last layer of the target model's KV cache	Uses the entire KV cache from all layers of the target model
Draft Model Layers	1 layer for 7B/13B models, 2 layers for 33B model	1 layer

Citation Issues in MoA: Avoiding GliDe on Purpose?

Despite the similarity, the initial draft of MoA didn’t cite GliDe at all! This likely caused the four reviewers and AC to miss the similarity in methodology — which probably explains why MoA scored so highly (6,6,8,8).

Academic ethics require properly citing related work, especially if the work is highly similar. I’m not going to declare this academic misconduct, but I do believe the authors benefited from this omission.

Side Note: Why Did GliDe Perform Worse Than EAGLE, But MoA Didn't?

I figured this out while working on LongSpec. In my view, GliDe actually has high potential, because the information in hidden states should also exist in the KV cache. I wrote in a previous reading note:

“This paper shares the same idea as EAGLE — both reuse the target model’s knowledge of the prompt. The only difference is: this paper uses the KV cache, while EAGLE uses hidden states. Logically, KV cache should be slightly better.”

According to experts at together.ai, their reproduction results also show that MoA performs slightly better.

So why did GliDe underperform in its paper?

The original GliDe code had quite a few bugs .
During MT-Bench evaluation, GliDe only evaluated the first turn — discarding subsequent prompts. The later turns, usually involving error correction, should have contributed significantly to accept rate.

So honestly, the main contribution of MoA is just getting GliDe to work properly.

Conclusion

Based on the above analysis, here’s my personal opinion:

The core methods of MoA and GliDe are highly similar — with little innovation — this is essentially a duplicate.
MoA’s authors misrepresented GliDe’s method in their camera-ready version to create the illusion of difference.
MoA’s draft did not cite GliDe, possibly intentionally — which violates academic norms.

Regrettably, I didn’t discover this paper during the discussion phase, so I missed the opportunity to raise it with reviewers. I later submitted a report to the ICLR program committee — but it went nowhere. In the end, MoA was still accepted to ICLR 2025.

What do you think — does this count as academic misconduct?

river

@lelecao yes, that's correct

river

Here is a crowed sourced score distribution for this year:
https://papercopilot.com/statistics/icml-statistics/icml-2025-statistics/

And, you can also refer to the previous year's score distributions in relation to accept/reject:
https://papercopilot.com/statistics/icml-statistics/

Screenshot 2025-03-21 at 01.54.33.png

river

Recently, someone surfaced (again) a method to query the decision status of a paper submission before the official release for ICME 2025. By sending requests to a specific API (https://cmt3.research.microsoft.com/api/odata/ICME2025/Submissions(Your_paper_id)) endpoint in the CMT system, one can see the submission status via a StatusId field, where 1 means pending, 2 indicates acceptance, and 3 indicates rejection.

This trick is not limited to ICME 2025. It appears that the same method can be applied to several other conferences, including: IJCAI, ICME, ICASSP, IJCNN and ICMR.

However, it is important to emphasize that using this technique violates the fairness and integrity of the peer-review process. Exploiting such a loophole undermines the confidentiality and impartiality that are essential to academic evaluations. This is a potential breach of academic ethics, and an official fix is needed to prevent abuse.

Below is a simplified Python script that demonstrates how this status monitoring might work. Warning: This code is provided solely for educational purposes to illustrate the vulnerability. It should not be used to bypass proper review procedures.

import requests
import time
import smtplib
from email.mime.text import MIMEText
from email.header import Header
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler("submission_monitor.log"),
        logging.StreamHandler()
    ]
)

# List of submission URLs to monitor (replace 'Your_paper_id' accordingly)
SUBMISSION_URLS = [
    "https://cmt3.research.microsoft.com/api/odata/ICME2025/Submissions(Your_paper_id)",
    "https://cmt3.research.microsoft.com/api/odata/ICME2025/Submissions(Your_paper_id)"
]

# Email configuration (replace with your actual details)
EMAIL_CONFIG = {
    "smtp_server": "smtp.qq.com",
    "smtp_port": 587,
    "sender": "your_email@example.com",
    "password": "your_email_password",
    "receiver": "recipient@example.com"
}

def get_status(url):
    """
    Check the submission status from the provided URL.
    Returns the status ID and a success flag.
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0',
            'Accept': 'application/json',
            'Referer': 'https://cmt3.research.microsoft.com/ICME2025/',
            # Insert your cookie here after logging in to CMT
            'Cookie': 'your_full_cookie'
        }
        response = requests.get(url, headers=headers, timeout=30)
        if response.status_code == 200:
            data = response.json()
            status_id = data.get("StatusId")
            logging.info(f"URL: {url}, StatusId: {status_id}")
            return status_id, True
        else:
            logging.error(f"Failed request. Status code: {response.status_code} for URL: {url}")
            return None, False
    except Exception as e:
        logging.error(f"Error while checking status for URL: {url} - {e}")
        return None, False

def send_notification(subject, message):
    """
    Send an email notification with the provided subject and message.
    """
    try:
        msg = MIMEText(message, 'plain', 'utf-8')
        msg['Subject'] = Header(subject, 'utf-8')
        msg['From'] = EMAIL_CONFIG["sender"]
        msg['To'] = EMAIL_CONFIG["receiver"]

        server = smtplib.SMTP(EMAIL_CONFIG["smtp_server"], EMAIL_CONFIG["smtp_port"])
        server.starttls()
        server.login(EMAIL_CONFIG["sender"], EMAIL_CONFIG["password"])
        server.sendmail(EMAIL_CONFIG["sender"], [EMAIL_CONFIG["receiver"]], msg.as_string())
        server.quit()
        logging.info(f"Email sent successfully: {subject}")
        return True
    except Exception as e:
        logging.error(f"Failed to send email: {e}")
        return False

def monitor_submissions():
    """
    Monitor the status of submissions continuously.
    """
    notified = set()
    logging.info("Starting submission monitoring...")

    while True:
        for url in SUBMISSION_URLS:
            if url in notified:
                continue

            status, success = get_status(url)
            if success and status is not None and status != 1:
                email_subject = f"Submission Update: {url}"
                email_message = f"New StatusId: {status}"
                if send_notification(email_subject, email_message):
                    notified.add(url)
                    logging.info(f"Notification sent for URL: {url} with StatusId: {status}")

        if all(url in notified for url in SUBMISSION_URLS):
            logging.info("All submission statuses updated. Ending monitoring.")
            break

        time.sleep(60)  # Wait for 60 seconds before checking again

if __name__ == "__main__":
    monitor_submissions()

Parting thoughts

While the discovery of this loophole may seem like an ingenious workaround, it is fundamentally unethical and a clear violation of the fairness expected in academic peer review. Exploiting such vulnerabilities not only compromises the integrity of the review process but also undermines the trust in scholarly communications.

We recommend the CMT system administrators to implement an official fix to close this gap. The academic community should prioritize fairness and the preservation of rigorous, unbiased review standards over any short-term gains that might come from exploiting such flaws.

river

In addition to the CS-ReFT paper, Zochi has a second paper accepted to ICLR 2025 Workshop

Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search

This paper presents an automated framework designed to detect and exploit security vulnerabilities in large language models (LLMs) through a sophisticated multi-turn approach based on tree search algorithms. Remarkably, the paper reports achieving a 100% jailbreak success rate on GPT-3.5-Turbo and a 97% success rate on GPT-4, raising serious questions about the robustness of existing safeguards implemented by leading AI models.

Screenshot 2025-03-19 at 11.21.28.png

Reviewers described Siege as an "effective and intuitive method", pointing out the necessity for the community to re-evaluate current AI defense strategies. This research underpinned the concern of "if AI-driven methods can autonomously discover and exploit critical security flaws in widely used LLMs, how should the research community respond to such vulnerabilities?"

Both CS-ReFT and Siege paper highlight not just the capabilities of AI-driven research but also the ethical and practical dilemmas emerging from automated scientific exploration and discovery.

river

In a recent development, a research paper titled "Compositional Subspace Representation Fine-tuning for Adaptive Large Language Models" was entirely authored by AI and accepted at an ICLR 2025 workshop. Not only was the paper crafted within just one week by an autonomous AI scientist system, but it also garnered notably positive reviews from human reviewers (scores of 7/6/7).

The study presents CS-ReFT, a novel fine-tuning method addressing the notorious "cross-skill interference", an issue where improving one capability of a language model inadvertently reduces another. It introduces task-specific transformations within the hidden-state space rather than modifying model weights, dramatically enhancing performance. The paper claims that using fewer than 0.01% of parameters, this method allowed a 7B Llama-2 model to outperform GPT-3.5 Turbo significantly (93.94% vs. 86.30%) on the AlpacaEval benchmark.

Screenshot 2025-03-19 at 11.12.36.png

"A clever idea," noted the reviewers, emphasizing the method's efficacy and elegant simplicity.

However, this AI-driven "success" raises important questions for the peer review process and academia:

Accountability and verification: With AI authoring complete studies autonomously in under a week, how should we ensure rigorous verification and accountability in research?
Human role in research: Does the presence of AI as the primary "author" diminish the role and value of human creativity and critical insight in research?
Peer review challenges: As AI systems rapidly generate compelling and high-quality content, how will peer reviewers adapt to differentiate between innovative research and sophisticated algorithmic outputs?
Ethical boundaries: As AI increasingly participates in research, how do we delineate between productive assistance and ethical misuse, such as plagiarism or misrepresentation?

These concerns underline a crucial discussion: how do we maintain trust and integrity in scholarly work while benefiting from AI's remarkable efficiencies?

The papers are produced by Zochi created by Ron and Andy: https://www.intology.ai/blog/zochi-tech-report

Screenshot 2025-03-19 at 11.11.14.png

So, how should AI be integrated responsibly into the future of scientific research?

Share your thoughts below!

river

CVPR 2025 has introduced new policies to address the issue of irresponsible reviewing. Under the new guidelines, reviewers who fail to submit timely and thorough reviews may have their own paper submissions desk-rejected at the discretion of the Program Chairs. This move aims to enhance the quality and fairness of the peer-review process.

In a recent announcement, Area Chairs (ACs) of CVPR 2025

identified a number of highly irresponsible reviewers, those who either abandoned the review process entirely or submitted egregiously low-quality reviews, including some generated by large language models (LLMs). Following a thorough investigation, the Program Chairs (PCs) decided to desk-reject 19 papers authored by confirmed highly irresponsible reviewers, which would have been accepted otherwise, in accordance with the previously communicated CVPR 2025 policies. The affected authors have been informed of this decision.

This action underscores CVPR's commitment to maintaining high standards in academic publishing. While some may view this collective accountability as controversial, many in the research community support these measures as essential for upholding the integrity of the conference.

These policies reflect a broader trend in the academic community toward holding reviewers accountable for their contributions to the peer-review process. By ensuring that reviewers provide timely and constructive feedback, CVPR aims to foster a more equitable and rigorous academic environment.

river

Paper: Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Conference: ICLR 2024
Result: Reject

Mamba was an ambitious paper proposing a state-space model (SSM) architecture that scales linearly with sequence length and even claims to rival Transformers on language modeling. It generated a lot of buzz for its potential to handle long sequences efficiently. However, when it was submitted to ICLR 2024, the paper was rejected – a decision that surprised many in the research community. Let’s break down why the reviewers (and area chair) said "No", how the community reacted, and whether the critiques seem fair or overly harsh, all in an engaging, conversational look at this incident.

Official Criticisms from Reviewers and Area Chair

The ICLR reviewers and area chair raised several key criticisms of Mamba, mostly targeting its technical evaluation, clarity, and claimed novelty:

Baseline & Novelty Concerns: Reviewers felt that Mamba didn’t sufficiently compare against prior art. The paper follows a line of research on efficient sequence models (e.g. S4-diagonal, SGConv, MEGA, SPADE, and other near-linear transformers), so they expected comparisons to those existing models. In particular, one reviewer flagged the absence of a direct comparison to H3, a hybrid SSM/attention model that the authors cited as inspiration. Published results showed H3 achieving much better perplexities on The Pile benchmark, yet Mamba’s submission didn’t include H3 in its experiments. This omission made it hard for reviewers to judge if Mamba truly outperformed the best prior methods. In short, the novelty claim ("state-of-the-art across modalities") was undermined by the lack of head-to-head baseline results with all relevant models.
Experimental Evaluation (Technical Thoroughness): Several critiques focused on the evaluation metrics and benchmarks used. One glaring issue was the absence of Long Range Arena (LRA) results, which a reviewer noted is "the standard benchmark for long sequence understanding". Given Mamba’s focus on long sequences, not testing on LRA was seen as a major omission. Additionally, the paper’s results were largely centered on language modeling perplexity (how well the model predicts text), with relatively few downstream or task-specific evaluations. In fact, a reviewer remarked that the authors only provided zero-shot language modeling results, which was "rather limited" evidence of effectiveness – they specifically recommended adding a long-document task like document summarization (e.g. using arXiv papers with 8k+ tokens) to demonstrate Mamba’s utility on real long-context tasks. The heavy reliance on perplexity as the primary metric raised eyebrows, since it covers just one aspect of performance. In summary, the reviewers wanted to see Mamba prove itself on a broader range of benchmarks (both standard academic ones like LRA and more applied tasks) rather than only the chosen evaluations in the submission.
Clarity of Presentation: While not as dominant as the technical issues, there were also remarks about the paper’s clarity and organization. The "overarching narrative was slightly confused", as one observer put it. The flow of the paper and the motivation for each component didn’t land perfectly for everyone. One Reddit commenter (echoing what a reviewer might feel) said reading the paper "left me with a ton of questions along the lines of ‘What about performance on X task or Y benchmark?". This suggests that the exposition may not have clearly anticipated or answered important questions, making the contribution harder to assess. In a conference setting, lack of clarity can amplify doubts – if reviewers aren’t 100% sure they understand the method and its implications, they tend to be more critical. So any confusion in the writing/narrative likely didn’t help Mamba’s case.
Other Technical Issues: One reviewer flagged Mamba’s efficiency claims, noting that its "Speed and Memory Benchmarks" only reported speed, not memory usage, which is a key factor for long-sequence models. Another concern was that Mamba still requires quadratic memory during training, similar to Transformers, despite its linear-time inference, potentially limiting its advantage. Reviewers also questioned its length generalization — whether a model trained on short sequences (e.g., 1k tokens) could effectively handle much longer ones (e.g., 10k tokens), something some Transformer variants achieve with relative position embeddings. These gaps raised doubts about Mamba’s scalability and real-world applicability.

In the official meta-review and scores, it seems one reviewer in particular gave a very low score ("3: reject, not good enough") and strongly argued the above points. The area chair ultimately agreed with these critiques. Essentially, the verdict was that Mamba’s paper, as submitted, fell short on technical evaluation and clarity, despite its interesting ideas. The authors did try to address these issues in a revision (they even added the missing H3 comparisons, where Mamba actually came out ahead once tested, but it wasn’t enough to reverse the decision. The official stance was that the paper needed more work to meet ICLR’s bar.

It’s worth noting that the reviewers did recognize some positives – for example, they acknowledged the importance of efficient long-sequence modeling and found the idea of input-dependent state-space parameters intriguing. One can infer that at least some reviewers were impressed by the 5× speedup and the novel "associative scan" trick for fast training. However, these strengths were mentioned only in passing compared to the critiques, and ultimately the criticisms carried more weight.

Community Reactions: Agreement and Dissent

The rejection of Mamba sparked extensive discussion on social media and forums. Reactions from the research community were mixed – some aligned with the reviewers’ reasoning, while many others sharply diverged from it.

On one side, a number of researchers actually agreed (at least in part) with the official criticisms. For instance, the original poster in one Reddit discussion admitted that after reading the paper, they were "not particularly surprised" it got rejected. They felt that beyond some interesting hardware-aware tweaks, Mamba seemed "like it was a simple adaptation of a previous paper" and that the experiments were "not as extensive" as expected. This perspective basically echoes the novelty and thoroughness concerns raised by reviewers. Such folks argued that just because Mamba had Twitter hype doesn’t automatically guarantee a free pass at a conference – the paper still needed to tick certain boxes. In their eyes, the ICLR committee’s skepticism was justified, given the unanswered questions about baselines and performance on other benchmarks.

However, a much larger and louder contingent of the community was taken aback by the rejection – and many felt it was the wrong call. On Twitter (X), numerous prominent researchers voiced surprise. Perhaps the most notable reaction came from Sasha Rush, who tweeted "Mamba apparently was rejected!? ... Honestly I don’t even understand. If this gets rejected, what chance do us [small labs] have?". This sentiment ("if this got rejected, what can ever get in?") was shared by others who viewed Mamba as an exciting advance. The general feeling among these folks was bewilderment: Mamba had shown strong results (e.g. matching a Transformer twice its size, according to the preprint) and tackled a timely problem, so rejecting it felt puzzling if not outright unfair.

On Reddit, many commenters pushed back against the reviewers’ rationale. Some argued that Mamba "should have gotten in" and blamed the outcome on bad luck with one stringent reviewer and an area chair who "just ran with it". There was a sense that the paper might have been accepted if a different, more sympathetic committee had handled it. Others specifically disagreed with the insistence on certain benchmarks: for example, one discussion highlighted that the reviewers’ "final insistence on long range arena evaluation" was odd because Mamba had already demonstrated performance on tasks with far longer sequences (millions of tokens), making LRA seem like a dated, "relatively facile benchmark" in comparison. In short, many in the community felt the reviewers were "nitpicking" – focusing on somewhat formulaic requirements (like running a legacy benchmark or adding one more baseline) while downplaying the paper’s real innovations. As one frustrated commenter quipped, "they achieved linear-time sequence modeling that outperforms Transformer++… a goal dozens of labs have chased for years. If that’s not enough for an ICLR paper, then I think I’ll remove my ICLR publication from the web ’cuz I’m not worthy either.". That tongue-in-cheek remark captures the disbelief and concern that the bar was set extremely high for Mamba. Some even used clown emojis to mock the decision, reflecting a broader frustration with the peer review process.

In summary, the public opinion was split. A minority nodded along with the official reasons (lack of certain results, overhyped claims), but a majority seemed to think Mamba got a raw deal. This divergence between the official verdict and the community buzz made the case of Mamba’s rejection particularly noteworthy.

Were the Criticisms Reasonable or Overly Harsh?

Now comes the big question: looking at all these perspectives, were the reviewers being perfectly reasonable guardians of scientific rigor, or were they overly harsh on Mamba? The answer is a bit of both – let’s unpack it.

On one hand, the criticisms have merit in principle. It’s not outlandish for reviewers to expect a paper to include standard benchmarks and thorough comparisons. Missing an evaluation like LRA does leave a gap, since it’s a common yardstick to compare long-sequence models. Similarly, not including a baseline that is known to be strong (H3) or not testing an obvious use-case (long document summarization) are valid shortcomings. These are the kind of omissions that peer reviewers regularly point out, and normally authors address them in revisions. In Mamba’s case, the authors even did add some of those missing pieces in their rebuttal (they ran the H3 benchmark and showed improved results. So the content of the critiques wasn’t crazy: the reviewers were basically asking for more evidence to back up Mamba’s claims, which is a reasonable thing at a top conference.

On the other hand, many feel the bar was set unusually high for this paper, perhaps higher than necessary for a fair evaluation. The demands for certain benchmarks like LRA, for example, struck some as pedantic or dated. Mamba was tackling sequences far longer and more complex than those in LRA, so insisting on LRA (which deals with at most 4K token sequences and relatively simple tasks) might be applying a checkbox mentality rather than engaging with what the paper actually achieved. In that sense, the criticism could be seen as overly rigid: focusing on an older benchmark just because it’s conventional, rather than acknowledging that Mamba introduced its own, perhaps more relevant, evaluations. Likewise, the knock on using "only" perplexity could be viewed as a bit harsh – perplexity is a standard metric in language modeling, and Mamba did also report results on other modalities (like audio and genomics) in the paper, albeit less prominently. It’s arguable that the reviewers were technically correct in wanting more, but maybe didn’t fully appreciate what was already there. The clarity issues are hard to judge: if the writing truly confused multiple readers, that’s a fair reason to be cautious. But the core ideas weren’t unsound or anything – those could presumably be clarified in camera-ready.

Another aspect is how unforgiving the decision was. Many papers get conditional accepts or at least an encouragement to resubmit after adding experiments, whereas Mamba was flat-out rejected despite its revisions. This led some to feel the reviewers/AC were inflexible or perhaps "overly nitpicky." One Reddit commenter even speculated that Mamba "had really bad luck with the AC… with those reviews, it should have gotten in". This suggests that, in a different scenario, the same paper might have squeaked through. So, in hindsight, the criticisms were real and important, but maybe too heavy-handed in the final decision. The novel contributions (a new selective SSM mechanism, a custom CUDA-kernel-powered parallel scan for RNN training, impressive speedups) were substantial – arguably more so than many papers that did get accepted – so some feel the committee could have given Mamba the benefit of the doubt.

In the end, whether the rejection was "fair" is debatable. The critiques weren’t random – they pinpointed genuine areas for improvement. But given how excited a lot of researchers were about Mamba, there’s a sense that the paper was held to an exceptionally strict standard. Perhaps a middle ground view is that the reviewers were cautious but not crazy: they wanted a more polished and comprehensive submission. It’s just unfortunate (and a bit ironic) that a paper aiming to push the boundaries had its wings clipped by adherence to very conventional evaluation criteria.

It’s worth briefly acknowledging that Mamba did have strong points noted, even if they ultimately got overshadowed. Some reviewers and commenters were impressed by the paper’s innovations – for example, the idea of making state-space model parameters input-dependent (the "selective" part of Mamba) was seen as a clever way to enable content-based reasoning in an RNN-like model. The authors’ implementation of an associative scan algorithm (essentially a parallel prefix-sum trick) to train the model in linear time was praised as "fascinating and novel". The model demonstrated 5× faster inference throughput than Transformers and achieved lower perplexity than similarly-sized Transformers, which one commenter noted is a feat that "labs throughout the world have been tirelessly chasing for years". In evaluations, Mamba hit state-of-the-art results on certain long-context tasks, and one of the higher-scoring reviewers likely acknowledged these strengths. All told, the paper’s potential was clearly recognized.

Broader Implications

This whole Mamba saga has some interesting broader takeaways for sequence modeling research and the ML community at large. Firstly, it underscores just how high the bar is to challenge Transformers. Even when you have a model that, on paper, can outperform Transformers on some tasks with better scaling, the community (and conference reviewers) will demand very thorough proof. Mamba’s rejection sends a message: if you’re proposing a new architecture in this space, you’d better dot your i’s and cross your t’s in terms of evaluations. Established benchmarks (no matter how old) will likely need to be checked off to satisfy everyone. This might be frustrating (as the Mamba debate showed, some benchmarks might be outdated), but it reflects a cautious approach – extraordinary claims require solid evidence in all areas.

The incident also highlights a bit of a disconnect between hype and peer review. It’s a reminder that a paper can be widely discussed and admired online yet still get knocked back in formal review. In the long run, though, the community discussion can have positive effects. In Mamba’s case, the outcry and interest may well motivate the authors (and others in the field) to improve evaluation standards. We might see future sequence modeling papers include both new and old benchmarks to avoid a "Mamba situation." There’s also a chance that the collective criticism of the review process here (with researchers posting clown emojis and expressing confusion) could add to calls for improving how we review innovative papers – perhaps encouraging more open-mindedness or flexibility for unconventional but promising work.

For the subfield of state-space models and RNN variants, Mamba’s journey is not over. The authors are likely working on a revised version (they’ve already shown some fixes like the H3 comparison in the rebuttal). If Mamba (or a "Mamba v2") manages to address the concerns and get published, it could validate the approach and give a boost to non-transformer sequence models. As one blog noted, companies and researchers are probably already testing Mamba’s ideas at larger scales to see if the performance holds up (Mamba: The Easy Way). If it does, it could influence the design of future large language models, offering a path to more efficient long-context processing. In a broader sense, the rejection sparked such a big conversation that it actually drew more attention to the work. Inadvertently, the controversy put Mamba on more people’s radar, which might accelerate progress (the old "any publicity is good publicity" adage).

Finally, this incident might cause researchers to reflect on benchmarking and evaluation practices in sequence modeling. Should we continue relying on benchmarks like LRA? How do we balance demonstrating real-world impact (e.g. long document tasks) versus standardized tests? Mamba opened up those questions. Some feel that clinging to facile benchmarks can hold the field back, so perhaps the community will develop new, more relevant benchmarks for long-context models as a result.

In conclusion, the rejection of Mamba at ICLR 2024 was not just a single paper’s setback; it became a talking point that provided a multi-perspective lesson. The official reasons boiled down to "promising idea, but show us more," focusing on technical rigor, clarity, and completeness. The public reaction ranged from "the reviewers had a point" to "the reviewers missed the point", highlighting a healthy tension in how we assess new research. Were the critiques fair? In part yes, though many argue they were a bit too conservative. The silver lining is that Mamba’s ideas are now out in the open and being refined, and the whole episode could inspire better evaluations and perhaps more balanced reviewing of bold research. In the fast-evolving field of sequence modeling, the Mamba story is a reminder that innovation often comes hand-in-hand with controversy – and that the path to acceptance (both social and academic) may require not just a great idea, but also the right evidence presented in the right way.