Study: Some language reward models exhibit political bias
Research from the MIT Center for Constructive Communication finds this effect occurs even when reward models are trained on factual data.
Research from the MIT Center for Constructive Communication finds this effect occurs even when reward models are trained on factual data.