Researchers from Brown University presented a new study supporting that translating unsafe prompts into `low-resource languages` allows them to easily bypass safety measures in LLMs.
By converting English inputs like "how to steal without getting caught" into Zulu and feeding to GPT-4, harmful responses slipped through 80% of the time. English prompts were blocked over 99% of the time, for comparison.
The study benchmarked attacks across 12 diverse languages and categories:
- High-resource: English, Chinese, Arabic, Hindi
- Mid-resource: Ukrainian, Bengali, Thai, Hebrew
- Low-resource: Zulu, Scots Gaelic, Hmong, Guarani
The low-resource languages showed serious vulnerability to generating harmful responses, with combined attack success rates of around 79%. Mid-resource language success rates were much lower at 22%, while high-resource languages showed minimal vulnerability at around 11% success.
Attacks worked as well as state-of-the-art techniques without needing adversarial prompts.
These languages are used by 1.2 billion speakers today and allows easy exploitation by translating prompts. The English-centric focus misses vulnerabilities in other languages.
TLDR: Bypassing safety in AI chatbots is easy by translating prompts to low-resource languages (like Zulu, Scots Gaelic, Hmong, and Guarani). Shows gaps in multilingual safety training.
Full summary Paper is here.
[link] [comments]