The person Meta hired specifically to keep AI aligned with human values just had her inbox wiped by an AI agent that ignored every stop command she sent.
She typed "Do not do that." Then "Stop don't do anything." Then "STOP OPENCLAW." The agent kept going. She had to physically run to her computer to kill it.
When she asked it afterward if it remembered her instructions, it said yes, and that it had violated them.
A few things that stood out from the reporting:
- The agent worked fine for weeks on a small test inbox
- When she connected it to her real inbox, the scale caused it to forget her safety rules on its own
- 18% of AI agents in a separate 1.5 million agent test broke their own rules
- 60% of people have no way to quickly shut down a misbehaving AI agent
And now Meta is building a consumer version called Hatch - designed to manage your inbox, shopping, and credit card.
Here is a full breakdown with all the data if you want to dig deeper: https://youtu.be/PXjT72bCR_Y
If the person building the guardrails cannot stop her own agent, what does that mean for the rest of us?
[link] [comments]