Alibaba's SWE-CI Benchmark Exposes a Critical Gap: AI Agents Break Working Code 75% of the Time During Maintenance
A new benchmark testing AI agents on real codebase evolution — not just greenfield tasks — finds that three-quarters of agent interventions introduce regressions. Only two models cross the 50% zero-regression threshold.
Subscribe to unlock all stories
Get full access to The Singularity Ledger, archive included.
Cancel anytime. Payments powered by Stripe.