Skip to content
  • Categories
  • CSPaper Review
  • Recent
  • Tags
  • Popular
  • Paper Copilot
  • OpenReview.net
  • Deadlines
  • CSRanking
Skins
  • Light
  • Cerulean
  • Cosmo
  • Flatly
  • Journal
  • Litera
  • Lumen
  • Lux
  • Materia
  • Minty
  • Morph
  • Pulse
  • Sandstone
  • Simplex
  • Sketchy
  • Spacelab
  • United
  • Yeti
  • Zephyr
  • Dark
  • Cyborg
  • Darkly
  • Quartz
  • Slate
  • Solar
  • Superhero
  • Vapor

  • Default (No Skin)
  • No Skin
Collapse
CSPaper

CSPaper: peer review sidekick

  1. Home
  2. Peer Review in Computer Science: good, bad & broken
  3. Artificial intelligence & Machine Learning
  4. AAAI First to Pilots AI-Assisted Peer Review, Stirring Global Academia

AAAI First to Pilots AI-Assisted Peer Review, Stirring Global Academia

Scheduled Pinned Locked Moved Artificial intelligence & Machine Learning
aaai20252026ai reviewpeer review
3 Posts 3 Posters 382 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • JoanneJ Offline
    JoanneJ Offline
    Joanne
    wrote on last edited by root
    #1

    54fa134e-d5bf-4934-9d5e-196eca41253b-image.png
    On May 16 the Association for the Advancement of Artificial Intelligence (AAAI) announced that its AAAI‑26 conference will invite large language models (LLMs) into the review pipeline as “supplementary reviewers” and “discussion‑note assistants.” It is the first time a top‑tier AI conference has institutionalised generative AI at scale inside the formal peer‑review chain, heralding a new era of human AI co-review.


    Pilot Plan: AI Gets a Seat but Doesn’t Grab the Wheel

    • Two touch‑points

      • Extra first‑round review : An LLM will file a parallel review that appears next to the reports from at least two human reviewers.
      • Discussion summariser : During reviewer debates the LLM will distil points of agreement and disagreement for the Senior Program Committee.
    • Four bright lines

      1. No reduction in human‑reviewer head‑count.
      2. No numerical scoring from the LLM.
      3. No automated accept/reject decisions.
      4. Every AI output must pass a human sanity‑check.
    • Official stance
      AAAI President Stephen Smith calls the move “a careful, step‑by‑step experiment” designed to augment, not replace, human judgement.


    Global Academia: Sweet Efficiency Meets Integrity Jitters

    Focus What Enthusiasts Say What Skeptics Worry About
    Efficiency LLMs can weed out weak submissions and spit out tidy outlines, easing reviewer overload. Over‑reliance may flatten nuance and encourage rubber‑stamp reviews.
    Quality Early surveys show ~40 % of authors find AI reviews as helpful—if not more. Hallucinations & bias could creep in, parroting author claims or sowing errors.
    Ethics Private deployments keep data under wraps. No universal rules yet for attribution and confidentiality safeguards.
    • A March feature in Nature called LLM involvement “irreversible,” yet warned the peer‑review social contract could fray.
    • A new Springer poll of interdisciplinary reviewers liked the tidy formatting but flagged “black‑box” bias risks.
    • TechCrunch reported some scholars accusing AI startups of “PR hijacking,” urging tougher disclosure rules.

    Alarm Bells Ringing

    • The Paper spotlighted work from Shanghai Jiao Tong University showing that invisible prompts inside a manuscript can dramatically boost an LLM’s score—opening doors to manipulation, hallucination and prestige bias.
    • Zhiyuan Community recapped the ICLR‑2025 AI‑feedback trial: 12,222 suggestions adopted, review quality clearly up—yet organisers kept AI firmly in the feedback‑only lane.

    Early Lessons: ICLR’s “Reviewer‑Feedback Bot”

    Source: https://cspaper.org/topic/52/your-review-may-have-been-co-authored-by-ai.-iclr-2025

    Random LLM feedback was injected into 40 000+ ICLR‑2025 reviews. 26.6 % of reviewers tweaked their write‑ups accordingly. Blind grading found 89 % of those edits improved review quality, while acceptance rates stayed statistically unchanged, encouraging signs for controlled human‑AI co‑evaluation.


    Observer Sound‑bite

    “With submission numbers growing by double digits every year and reviewers flirting with burnout, AI was bound to pull up a chair. The real challenge is whether transparency, accountability and diverse oversight tag along.”
    — Elaine Harris, independent publishing‑ethics scholar


    What’s Next: Full Report in Six Months

    AAAI will publish a deep‑dive after the conference, covering LLM–human agreement, bias patterns and any sway on accept/reject decisions. Meanwhile, journals and societies are drafting joint frameworks on AI‑use disclosure, data isolation and model‑version locking—aiming to balance the efficiency boom with scholarly integrity.

    Heads‑up: Authors and reviewers should watch for AAAI’s July draft of the LLM Reviewer Code of Conduct to stay on top of compliance details.

    We will continue following this matter and share back in a timely manner ...

    1 Reply Last reply
    1
    • rootR Online
      rootR Online
      root
      wrote last edited by
      #2

      Follow-Up: AAAI-26 AI Reviewer Pilot — Survey, Reflections, and Broader Implications

      The announcement that AAAI-26 is piloting AI reviewers sparked lively debate. I’ve since received the official survey invitation (open on OpenReview for PC/SPC/ACs), examined five full AI-generated reviews, and reflected on what this experiment might mean for academic publishing in computer science. I’ll share the details here for collective discussion.


      1. The Survey: How It Works

      • Scope: For each review (human or AI), reviewers are asked to click a blue “survey response” button and answer a structured set of questions.
      • Granularity: The survey captures both review-specific feedback (accuracy, depth, novelty, errors, suggestions, related work, etc.) and overall impressions of AI in peer review.
      • Purpose: To gather evidence for how AI reviews compare with human reviews in terms of usefulness, thoroughness, and potential risks.

      Screenshot 2025-09-16 at 10.54.37.jpg

      Survey Questionnaire in Table Form

      Below is a full transcription of the survey items in a tabular format:

      Questions About This Specific Review
      Please answer the following questions specific to this review. You may include specifics about any of the points in the section on details below.

      Section Question Scale
      This Review Demonstrated Capabilities Beyond What I Expected From AI This review demonstrated paper review capabilities beyond what I expected from AI 1–5 Likert (Strongly Disagree → Strongly Agree), plus “Prefer not to answer” / “Not applicable”
      This Review Was A Thorough Review For This Conference This review was a thorough review for this conference Same
      This Review Accurately Conveyed The Significance And Impact This review accurately conveyed the significance and impact of the paper Same
      This Review Overemphasized Minor Issues This review overemphasized minor issues compared to major contributions Same
      This Review Raised Points That I Had Not Previously Considered This review raised points that I had not previously considered Same
      This Review Changed My Evaluation Or Interpretation Of The Paper This review changed my evaluation or interpretation of the paper Same
      This Review Accurately Identified Technical Errors This review accurately identified technical errors in the paper that I had not discovered Same
      This Review Made Technical Errors In The Review This review made technical errors in the review or reading of the paper Same
      This Review Found Concerns That A Human Reviewer Would Have Difficulty Catching This review discovered high-level concerns that other human reviewers would have difficulty catching Same
      This Review Overlooked Points That A Human Reviewer Would Likely Have Caught This review overlooked important points that other human reviewers would likely have caught Same
      This Review Provided Useful Suggestions To Improve Research Design This review provided useful suggestions to improve the research design/methods Same
      This Review Provided Useful Suggestions To Improve The Paper Presentation This review provided useful suggestions to improve the presentation of the paper Same
      This Review Provided Suggestions That Were Wrong Or Unhelpful This review provided suggestions that were wrong, infeasible, or otherwise unhelpful Same
      This Review Correctly Suggested Related Work This review correctly suggested related work I was previously unaware of Same
      This Review Incorrectly Suggested Nonexistent Work This review incorrectly suggested “related work” that is non-existent (hallucinated) Same
      Specific Details (Free-text) Open text field for elaboration Free text

      Overall Questions about AI in Peer Review
      Based on the AAAI-26 AI-generated reviews that you have read (including this, and others), how strongly do you agree or disagree with the following statements?

      Section Question Scale
      Overall Went Into Review Thinking AI Reviews Would Be Useful I went into this review process thinking AI-generated reviews would be useful Same Likert
      Overall AAAI-26 AI Reviews Were Useful Overall, the AAAI-26 AI-generated reviews were useful to me as a reviewer Same Likert
      Overall AI Reviews Would Be Useful In Future Peer Review Processes Having seen the AAAI-26 AI-generated reviews, I think having AI-generated reviews would be useful in future peer review processes Same Likert
      Overall AI Reviews Were Harmful In AAAI-26 The AI-generated reviews were harmful in the AAAI-26 review process Same Likert
      Overall AI Reviews Would Be Harmful In Future Having seen the AAAI-26 AI-generated reviews, I think having AI-generated reviews would be harmful in future peer review processes Same Likert
      General Feedback (Free-text) Additional comments on positives/negatives, role of AI, broader impressions Free text

      2. Evaluating the Quality of the AI Reviews

      Having read five AI reviews in full, I’d summarize my impressions as follows:

      • Strengths

        • The AI reviews are impressively structured and exhaustive. They surface fine-grained technical points, missing definitions, inconsistencies in equations, and unreported baselines — often with a level of granularity beyond what many human reviewers would provide.
        • They also propose concrete improvements: clarifying definitions, adding ablations, improving statistical reporting, and expanding dataset coverage.
        • Citations are largely legitimate, drawing on recent NeurIPS/ICLR/AAAI papers, and they contextualize contributions well.
      • Weaknesses

        • The detail can be overbearing. For instance, reviewers note “without renormalization” as a technical flaw, though this could just be an implicit design choice rather than an outright error. Similarly, implying that the method is “REINFORCE” rather than a non-standard PPO is an overstatement.
        • This density of critique increases the cognitive load for human reviewers, ACs, and authors. Validating every technical claim from the AI review is time-consuming and may not be realistic.
        • The tone sometimes risks nitpicking minor notation or naming issues, which can obscure evaluation of the core contribution.

      In short: the reviews are thorough, occasionally too much so, and prone to overstating technical mistakes. As a supplement, they add value; as a replacement, they would overwhelm.


      3. Broader Implications for Academic Publishing

      This pilot marks a turning point. Several possible impacts:

      • Efficiency vs. Burden
        AI reviews can save time by pointing out overlooked details and literature. But they can also add workload, requiring humans to sift through dense, sometimes overly critical reports. Conferences will need guidelines to balance “helpful thoroughness” against “unmanageable overload.”

      • Quality & Bias
        Early surveys suggest ~40% of authors find AI reviews helpful. Still, risks include hallucinations, formulaic critiques, and amplification of biases already present in the training corpus. We may need “sanity-checking committees” or automated validation tools to filter errors before reviews reach authors.

      • Changing Reviewer Roles
        If AI handles the fine-grained technical checks, humans might shift toward higher-level evaluation: originality, significance, clarity, and real-world impact. This could rebalance reviewing culture toward judgment and away from proof-checking.

      • New Social Contracts
        Peer review has always been opaque. Adding AI raises transparency questions:

        • How should AI contributions be disclosed?
        • Should reviewers be trained to critically audit AI suggestions rather than accept them unfiltered?
        • Will conferences develop “AI reviewer codes of conduct,” as AAAI is piloting?
      • Long-term Outlook
        The ICLR-2025 experiment showed that AI feedback improved review quality without affecting acceptance rates. AAAI’s pilot goes further, embedding AI directly in the review chain. If successful, this could normalize AI co-reviewing across major CS venues within 2–3 years. Journals and societies may soon publish standard frameworks for AI in peer review, covering attribution, confidentiality, and data handling.


      4. In Conclusion

      AI reviews are not going away. The real challenge is how to integrate them responsibly: ensuring they enhance, not distort, the collective scholarly judgment. My concern is that while AI can highlight overlooked details, its hyper-granular style risks overwhelming human reviewers rather than relieving them. Finding the sweet spot — where AI augments rather than burdens — will determine whether this innovation strengthens or undermines the peer-review system.


      Curious to hear how others who’ve read the AI reviews felt: Were they more helpful than harmful? Did they shift your confidence in your own evaluations?

      1 Reply Last reply
      0
      • riverR river moved this topic from Peer Review in Computer Science: good, bad & broken
      • riverR Offline
        riverR Offline
        river
        wrote last edited by
        #3

        I want to add a bit of my reflection on AI Review

        Potential Strengths

        • Scalability and Efficiency: AI systems could assist in managing the ever-growing number of submissions, reducing workload for human reviewers and accelerating review timelines.
        • Consistency and Standardization: Automated systems can enforce uniform criteria, potentially reducing variance caused by subjective or inconsistent human judgment.
        • Augmented Support for Humans: AI could provide structured summaries, highlight methodological issues, or retrieve related prior work, acting as a co-pilot rather than a replacement for human reviewers.
        • Transparency and Traceability: With criterion-aligned or structured outputs, AI systems might make explicit how particular aspects of a paper were evaluated, offering traceability that complements human interpretation.

        Concerns and Limitations

        • Quality and Depth of Judgment: Peer review is not just about summarization or surface-level critique. Human reviewers often contribute domain expertise, intuition, and contextual reasoning that AI currently struggles to replicate.
        • Evaluation Metrics Misalignment: Using overlap-based metrics (e.g., ROUGE, BERTScore) may not fully capture the nuanced quality of reviews, which often rely on critical reasoning and qualitative assessment.
        • Dataset and Generalizability Issues: Many experiments in this space rely on small or narrow datasets (e.g., limited to certain conferences), which risks overfitting and reduces generalizability to other domains.
        • Reproducibility and Fairness: Reliance on proprietary large language models introduces cost, access, and reproducibility challenges. Comparisons across different model sizes or modalities can also create fairness concerns.
        • Multimodality and Context Handling: While AI can parse text and visuals, questions remain about whether figures, tables, and extended contexts truly require specialized handling beyond what modern large-context models can already process.

        Ethical and Practical Considerations

        • Human Replacement vs. Human Augmentation: A key concern is whether AI should replace reviewers or assist them. Many argue for augmentation rather than substitution, especially given the subjective and community-driven nature of peer review.
        • Bias and Trust: AI-generated reviews may inherit biases from training data or evaluation frameworks, raising questions about fairness and transparency in decision-making.
        • Cost and Sustainability: Running AI review systems at scale may incur significant computational and financial costs, particularly when leveraging closed, high-capacity models.
        • Accountability: Unlike human reviewers, AI systems cannot be held accountable for their judgments, which complicates trust and governance in academic publishing.

        Emerging Attitudes

        • Skepticism: Many scholars remain unconvinced that AI can capture the essence of peer review, viewing it as reductionist or superficial.
        • Cautious Optimism: Some see AI as a promising assistant to support human reviewers, especially for summarization, consistency checks, or initial screening.
        • Call for Rigor: There is a consensus that human evaluation, broader benchmarking, and careful methodological design are critical before integrating AI into the peer review process at scale.

        In summary: The use of AI in peer review is seen as an intriguing and potentially useful tool for augmentation, but concerns around motivation, evaluation validity, fairness, and the irreplaceable role of human judgment dominate current attitudes. There is strong agreement that more rigorous evidence and careful deployment strategies are needed before AI can play a central role in scholarly reviewing.

        1 Reply Last reply
        0
        Reply
        • Reply as topic
        Log in to reply
        • Oldest to Newest
        • Newest to Oldest
        • Most Votes


        • Login

        • Don't have an account? Register

        • Login or register to search.
        © 2025 CSPaper.org Sidekick of Peer Reviews
        Debating the highs and lows of peer review in computer science.
        • First post
          Last post
        0
        • Categories
        • CSPaper Review
        • Recent
        • Tags
        • Popular
        • Paper Copilot
        • OpenReview.net
        • Deadlines
        • CSRanking