[Safety] Why can’t models filter bad training data rather than fine-tune?

Bias, safety, disinformation, copyright, and alignment are big problems with AI. Fine-tuning is used to mitigate issues, but that often makes the model less effective.

Why not have a moderation filter that uses a secondary model to block bad training data, so harmful content never enters the primary model in the first place? This way post-training nerfing/find-tuning wouldn't necessary.

For example, openAI has a moderations API endpoint that can be used to determine if a prompt is harmful. This could be used on training data, and on failure, don't feed that data into the new model. I assume the moderation API uses an embedding model which is a ton cheaper than an LLM for filtering purposes. I'm not saying this is how it should be done exactly, I'm just using it as an example.

Are there any models that were trained this way? I would think that the end result would be a much smarter model without the issues.

submitted by /u/funbike
[link] [comments]