Correct but catastrophic: missing signals in automated decision systems

Serious question for people working with ML systems that act autonomously.

We often optimize for correctness, confidence, or expected reward.

Yet many real incidents come from systems behaving exactly as designed,

while still causing irreversible damage (deletions, lockouts, enforcement, shutdowns).

Often these are bulk or automated actions, executed without a human explicitly deciding “this is safe to lose”.

This doesn’t feel like a bug problem, but a missing signal between

“the model is confident” and “this action is acceptable to execute without supervision”.

Which leads me to a question I can’t quite place in existing frameworks:

What if an automated system could recognize what deserves to be preserved,

even if it can’t explain why?

I’m not proposing a solution or a product, and I’m not claiming this is solvable.

I’m genuinely trying to understand whether this failure mode is already well-addressed

in the literature, or if we mostly patch it with heuristics and human-in-the-loop rules.

If you’ve seen relevant work, or lived through incidents where automation was

technically correct but practically destructive, I’d really appreciate pointers.

submitted by /u/jotachecks
[link] [comments]