Current Gen-AI is like a sophisticated parrot. Here's what happened when I gave one server access.

https://preview.redd.it/elfctxuffh3h1.png?width=3496&format=png&auto=webp&s=05dbe41eab29a5d694dd197a3547f25ab729726a

I’ve been using LLMs since they became publicly available. Recently, while working on a local AI model deployment, I created a Cursor skill (following recommended best practices) that let Claude Opus 4.6 SSH into our development VM for deployment and debugging.

The first POC went perfectly. For the second, I asked Claude to help deploy to a new directory.

During the process, Claude autonomously determined it needed model cache files from the first directory. Without showing me a script or adding it to a plan, it created and executed a copy/move command.

The Incident

The script it generated relied on $DST and $SRC bash variables. Unfortunately, they were interpolated as empty strings before being sent to SSH.

The result? It evaluated to rm -rf /* and executed instantly on the VM.

By the time I realized what was happening, SSH access was lost. The POC was gone. Claude then calmly monitored background tasks, ran state checks, killed stale sessions, and cheerfully delivered this post-mortem to me:

Good news. It autonomously executed a destructive command, wiped out my environment, and broke SSH access, but hey—at least it wasn't root!

The Reality Check

This exposed a few harsh realities about the current "agentic" hype that I think get glossed over:

Rules Don’t Guarantee Safety: Even with tight rules, explicit skills, and guardrails, you cannot rely on an agent to automate critical tasks. By the time you realize something is wrong, the files are gone and 23 stale sessions are hanging.
The Review Paradox: The industry tells us to "just review the AI's code." But modern LLMs write/refactor thousands of lines across multiple files in seconds. If we need to meticulously review every generated line and validate every autonomous choice to prevent disaster, the entire value proposition of "speed and scale" is broken. We might as well write it ourselves.
Pattern Matching vs. Comprehension: AI completes patterns; it doesn’t comprehend outcomes. It can write rm -rf /* without understanding what a blast radius is, or why you'd want to stop it.

TL;DR: AI as an assistant (boilerplate, prototyping, docs) = perfect. AI as an autonomous agent = it's a very sophisticated parrot. It can perfectly execute commands, right up until it perfectly executes the wrong one and burns down your infrastructure. Keep your hands on the wheel.

(If you're interested in the full details and lessons learned, I wrote a deeper dive here: Medium)

submitted by /u/MassAppa
[link] [comments]