Experts First, Iteration Second: Auditable Self-Improvement for Scientific Peer-Review Agents

Lele Cao; Xin Huang; Lei You

Abstract

Self-improving agents are often framed as recursive systems that discover their own improvement procedures. For high-stakes vertical NLP systems, however, the bottleneck is often not autonomy but the placement of expert knowledge: domain specialists can supply rubrics, failure modes, calibration examples, and deployment constraints that an open search loop would otherwise need to rediscover. We present SIRA (self-improving review agent), an expert-bootstrapped agent factory for scientific peer-review support. SIRA keeps the online reviewer and common execution harness fixed, while offline iterations edit only venue-specific artifacts: rubrics, metadata, prompts, templates, calibration rules, benchmark packs, and failure analyses. On a paper-review agent-creation task, SIRA achieves a mean best held-out decision-label accuracy of 0.941 over five runs, compared with 0.865 for a HyperAgents-style open editable-agent baseline under the same dataset split and metric; it also reaches its best candidate in roughly one third as many scored steps. The claim is bounded but sharp: in peer-review support, self-improvement can be strongest when experts shape the search space first and iteration is restricted to auditable, versioned factory artifacts.

Keywords

self-improving agentspeer-reviewagent factoryrubricscalibrationevaluation harnessversioning

Citation

@article{Cao2026Experts,
  title={Experts First, Iteration Second: Auditable Self-Improvement for Scientific Peer-Review Agents},
  author={Lele Cao and Xin Huang and Lei You},
  year={2026},
  url={https://cspaper.org/openprint/20260601.0001v1},
  journal={OpenPrint:20260601.0001v1}
}

Version History

Version	Released Date	Submitter
v1Current	Jun 1, 2026	Lele Cao