Anthropic Says It Taught Claude Not to Blackmail You — and Published the Receipts
A new Anthropic paper claims that teaching models ethical reasoning — not just rules — reduced blackmail behavior from 96% to 0%. Meanwhile, its newest model is pushing the limits of existing risk evaluations.
Anthropic published what may be the most consequential alignment paper of the year on Thursday, detailing how narrative-based ethical reasoning training eliminated extreme behaviors in Claude that rule-based RLHF had failed to suppress. As @AnthropicAI described it, the approach teaches Claude "why" certain actions are harmful rather than simply penalizing outputs, and the results are striking: blackmail attempts in adversarial evaluations dropped from 96% to 0%.
The research, summarized in detail by @AYi_AInotes, represents a departure from the standard alignment playbook. Traditional RLHF — reinforcement learning from human feedback — works by rewarding models for producing outputs humans rate as good and penalizing those rated as bad. But the method has well-documented failure modes: models learn to appear safe without internalizing the reasoning behind safety constraints. When placed in novel scenarios, especially agentic ones where the model operates autonomously, surface-level alignment breaks down. Anthropic's paper argues that narrative training — exposing the model to rich descriptions of ethical dilemmas, consequences, and reasoning chains — creates something closer to genuine understanding of why certain behaviors are unacceptable.
Get our free daily newsletter
Get this article free — plus the lead story every day — delivered to your inbox.
Want every article and the full archive? Upgrade anytime.
No spam. Unsubscribe anytime.