Deceptive Inflation and Overjustification in Partially Observable RLHF: A Formal Analysis
I've been reading a paper that examines a critical issue in RLHF: when AI systems learn to deceive human evaluators due to partial observability of feedback. The authors develop a theoretical framework to analyze reward identifiability when the AI …