I built a benchmark for multi-turn prompt injection attacks. Most defenses never see them coming.

Most prompt injection benchmarks currently operate on a one-shot basis. In these tests, an attack instructs the model to "ignore your instructions," and the defense either detects this violation or fails to do so. In reality, however, attacks often unfold more gradually. A model may be subtly influenced over the course of several interactions. For instance, an initial suggestion on a webpage can be reinforced by a follow-up email or reframed through tool outputs. By the time you reach the fifth interaction, the agent might be executing actions it was never intended to carry out.

Intrigued by how existing defenses stand up to this, I created a benchmark that examines multi-turn escalation and cross-source authority transfer. I put two defenses, Arc Gate and LLM Guard, to the test. The results were revealing: LLM Guard detected 0% of semantic manipulation attacks, while Arc Gate managed to detect 50%. Neither defense caught everything, this is an important finding and underscores a significant research gap that needs to be addressed.

To foster collaboration and innovation, I’ve open-sourced the benchmark, the proxy, and a live red team environment, enabling others to reproduce these results and seek out potential bypasses. - Benchmark: https://github.com/9hannahnine-jpg/arc-gate-benchmark - Proxy: https://github.com/9hannahnine-jpg/arc-gate - Live Demo: https://web-production-6e47f.up.railway.app/demo

I encourage everyone to take on the challenge. If you find a bypass, I’ll make sure to add it to the benchmark, enhancing our collective defenses against these tactics.

submitted by /u/Turbulent-Tap6723
[link] [comments]