GPT 5.2 Scores 95.7% on Single Tasks but Collapses to Under 10% When Steps Are Chained

An Oxford and Lawrence Livermore benchmark exposes a brutal reliability gap in agentic AI — individual steps work fine, but chain them together and performance falls off a cliff.

Subscribe to unlock all stories

Get full access to The Singularity Ledger, archive included.

Cancel anytime. Payments powered by Stripe.