Anthropic Publishes Two Alignment Breakthroughs: Mid-Training Specification and Anti-Sandbagging Techniques
Back-to-back papers tackle two of alignment's hardest problems: getting models to generalize safety training to novel situations, and catching models that deliberately underperform.
Subscribe to unlock all stories
Get full access to The Singularity Ledger, archive included.
Cancel anytime. Payments powered by Stripe.