AI Agents Still Can't Do Science: Stargazer Benchmark Exposes Discovery Gap

A new benchmark testing AI agents on scientific hypothesis generation and physical law recovery finds they fail at the creative leaps required for genuine discovery.

Researcher @ZhijingJin published the Stargazer benchmark, which tests AI agents on their ability to perform scientific discovery — generating hypotheses, designing experiments, and recovering physical laws from data. The results are sobering: current frontier models can predict outcomes within known frameworks but consistently fail at the kind of creative, abductive reasoning that characterizes genuine scientific breakthroughs.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae.

Unlock the full briefing

Get every story in today's briefing, the full archive, and the daily AI intelligence brief.

All stories today

Full archive

Daily brief