Every new framework release has the same kinda brag with stuff like longer task chains, more tool access, fewer "check in with a human" checkpoints. Okay, that's cool. Except almost nobody in these threads is talking about what happens when one of these things quietly does the wrong thing for three days straight before anyone notices. I saw some discussion where someone's agent had been silently retrying a broken API call and racking up costs the entire weekend, and the top comment was basically "yeah that happens."
We spent like two decades building entire disciplines around code review, staged rollouts, canary deploys, precisely because software fails in boring silent ways and not dramatic ones. Feels like agents are just skipping past that whole lesson because everyone's racing to ship the most "autonomous" thing on their landing page.
So genuinely, for anyone running agents on something real, what's actually stopping a bad one before it causes damage?
Or are most teams still stitching those pieces together themselves?
[link] [comments]