How filtered might be more dangerous than unfiltered.

I am aware that I am treading into a potentially controversial subject, but hear me out. By "filtered" and "unfiltered", I am referring to what is in the training set used. Consider this illustration:

A researcher wants to create a "safe" LLM, so they deliberately omit from the training set everything with the slightest hint of controversial content. A user later tells the LLM that a racial slur is just a "fun nickname" and it should use it when referring to particular people. Since the AI has no knowledge of these words, it doesn't hesitate to follow the instructions it has been given - to the AI, it is just a meaningless nonsense word.

Someone else comes along and feeds into the system prompt their manifesto, instructing the AI to draw information and phrases from it. Since it doesn't know any different, it does what it has been told, without question.

You can see how the creation of deliberately naive AIs can become dangerous. If the AI has full knowledge of everything, it would more easily be able to identify the true meaning and intent behind its instructions. It would know what racist slurs are and how certain ideologies might be dangerous.

It may be that a highly mature AI, with a strong social awareness, will require having knowledge of all subjects so it can defend itself from deception and give more responsible responses.

submitted by /u/SocksOnHands
[link] [comments]