I've been working on a project called PKE (Precision Knowledge Editing), an open-source method to improve the safety of LLMs by reducing toxic content generation without impacting their general performance. It works by identifying "toxic hotspots" in the model using neuron weight tracking and activation pathway tracing and modifying them through a custom loss function. There's lots of current Machine unlearning techniques that can make LLMs safer right now like:
- Exact Unlearning: This method involves retraining the model from scratch after removing the undesired data. While it ensures complete removal of the data's influence, it is computationally expensive and time-consuming, especially for large models.
- Approximate Unlearning:
- Fine-Tuning: adjusting the model using the remaining data to mitigate the influence of the removed data. However, this may not completely eliminate the data's impact.
- Gradient Ascent: applying gradient ascent on the loss function concerning the data to be forgotten, effectively 'unlearning' it. This method can be unstable and may degrade model performance.
PKE is better for the following reasons:
- Fine-Grained Identification of Toxic Parameters: PKE employs neuron weight tracking and activation pathway tracing to accurately pinpoint specific regions in the model responsible for generating toxic or harmful content. This precision allows for targeted interventions, reducing the risk of unintended alterations to the model's overall behavior.
- Maintaining Model Performance: By focusing edits on identified toxic regions, PKE minimizes the impact on the model's general performance. This approach ensures that the model retains its capabilities across various tasks while effectively mitigating the generation of undesirable content.
- Scalability Across Different Model Architectures: PKE has demonstrated effectiveness across various LLM architectures, including models like Llama2-7b and Llama-3-8b-instruct. This scalability makes it a versatile tool for enhancing safety in diverse AI systems.
Would love to hear your guys' thoughts on this project and how to continue to improve this methodology. If interested, here's the Github link: https://github.com/HydroXai/Enhancing-Safety-in-Large-Language-Models and paper .
[link] [comments]