LLMs keep getting more capable at generating natural language. But there's always room for improving the quality and alignment of their responses.
Typically this requires lots of human effort to collect more training data. So researchers are exploring ways for models to self-improve without human involvement.
Many methods use prompting - giving the LLM instructions to critique and refine its responses. But coming up with comprehensive prompts is challenging.
The new approach proposed, called PIT, lets models learn self-improvement implicitly from human preference data instead. It reformulates reinforcement learning to maximize the gap between an original response and improved response conditioned on the original.
This taps into the implicit guidance in the preference data on what constitutes better quality, so no manual rubrics are needed. PIT uses curriculum reinforcement learning - first improving easy references, then switching to the LLM's own samples.
Experiments on real and synthetic datasets show PIT significantly outperforms prompting methods like Self-Refine. It improved response quality 7-34% across conditions without any human involvement.
This demonstrates a promising direction for LLMs to align better with human preferences autonomously as they learn from experience. No need for human bottlenecks when expanding to new domains or underserved use cases. Very cool!
TLDR: New method PIT enables LLMs to implicitly learn to refine themselves from human preference data, no prompts needed. Big improvement over prompting approaches.
Arxiv is here: https://arxiv.org/abs/2310.00898
[link] [comments]