Alibaba's SWE-CI Benchmark Exposes a Critical Gap: AI Agents Break Working Code 75% of the Time During Maintenance

A new benchmark testing AI agents on real codebase evolution — not just greenfield tasks — finds that three-quarters of agent interventions introduce regressions. Only two models cross the 50% zero-regression threshold.

Subscribe to unlock all stories

Get full access to The Singularity Ledger, archive included.