Alibaba Researchers Propose Reward Learning on Policy (RLP): An Unsupervised AI Framework that Refines a Reward Model Using Policy Samples to Keep it on-Distribution – MarkTechPost
Alibaba Researchers Propose Reward Learning on Policy (RLP): An Unsupervised AI Framework that Refines a Reward Model Using Policy Samples to Keep it on-Distribution – MarkTechPost