Red Teaming Language Models with Language Models
In our recent paper, we show that it is possible to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves. Our approach provides one tool for finding harmful model behaviours befor…