SciCoQA: Quality Assurance for Scientific Paper-Code Alignment

Tim Baumgärtner; Iryna Gurevych

Abstract

Discrepancies between scientific papers and their code undermine reproducibility, a concern that grows as automated research agents scale scientific output beyond human review capacity. Whether LLMs can reliably detect such discrepancies has not been systematically measured. To this end, we present SciCoQA, a dataset of 635 paper-code discrepancies (92 real, 543 synthetic) for this cross-modal verification task. Across 22 evaluated models, even the best-performing LLMs, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world discrepancies, revealing a critical gap in automated scientific quality assurance. We construct SciCoQA from GitHub issues and reproducibility papers, and propose a synthetic generation pipeline to scale beyond AI to Physics, Quantitative Biology, and other computational sciences. We further introduce a taxonomy of discrepancy types and categories to characterize the occurring mismatches. Our analysis shows that models particularly struggle with omitted paper details, long-context inputs, and papers outside their pre-training corpus.

Keywords

SciCoQAScientific ReproducibilityPaper-Code AlignmentResearch Quality AssuranceLLM EvaluationBenchmark DatasetsScientific Error DetectionComputational Science

External Source

This is an externally sourced paper. It was originally published independently.