Claude Opus 4.6 Suffers Dramatic Accuracy Collapse as Benchmarks Reveal Undisclosed Reasoning Downgrade

Anthropic's flagship model dropped from 83.3% to 68.3% accuracy on hallucination benchmarks in a matter of days, with evidence suggesting the company quietly reduced default reasoning effort on coding tasks — and hid the change from session logs.

Claude Opus 4.6, Anthropic's most capable model, has suffered a striking performance collapse on independent benchmarks, falling from the #2 spot to #10 on the BridgeBench hallucination leaderboard in under a week. As @bridgemindai documented, the model's accuracy on coding hallucination tasks plummeted from 83.3% to just 68.3% — a 15-percentage-point drop that represents one of the sharpest regressions ever observed in a production frontier model.

The degradation was quickly amplified by @cb_doge, whose post went viral with the blunt assessment: "Anthropic's Claude Opus is FALLING." The thread noted that Grok 4.20 now holds the top position on the BridgeBench leaderboard, a competitive detail that added fuel to an already inflammatory disclosure. For developers who had integrated Opus 4.6 into production coding pipelines over the past several weeks, the news prompted immediate concern about the reliability of their downstream outputs.

Get our free daily newsletter

Get this article free — plus the lead story every day — delivered to your inbox.

Want every article and the full archive? Upgrade anytime.

No spam. Unsubscribe anytime.