A new study from Anthropic shows that poisoning AI models is much easier than we thought.
The key finding: It only takes a small, fixed number of malicious examples to create a hidden backdoor in a model. This number does not increase as the model gets larger and is trained on more data.
In their tests, researchers successfully poisoned models of various sizes using the same tiny number of bad examples as few as 250. For a large model, this was a negligible fraction (0.00016%) of its total training data.
This means the barrier for these kinds of attacks is very low. An attacker doesn't need to control a large percentage of the data, just a small, constant number of poisoned samples.
You can read the full details in the research article from Anthropic for a deeper dive.
Reference:
Anthropic Research: "A small number of samples can poison LLMs of any size" - https://www.anthropic.com/research/small-samples-poison
[link] [comments]