Great points! OpenAI’s new PaperBench shows how tough reproducibility still is in ML. It asked AI agents to replicate 20 ICML 2024 papers from scratch. Even the best model only got 21%, while human PhDs reached 41.4%.
[image: 1743714483369-screenshot-2025-04-03-at-23.07.45-resized.png]
What stood out is how they worked with authors to define 8,000+ fine-grained tasks for scoring. It shows we need better structure, clearer standards, and possibly LLM-assisted tools (like their JudgeEval) to assess reproducibility at scale.
Maybe it’s time to build structured reproducibility checks into peer review, i.e., tools like PaperBench give us a way forward.
Checkout the Github: https://github.com/openai/preparedness