Self-consistency Sampling Enhances Outcome-reward-based Reinforcement Learning of Multimodal LLMs, Correcting Unfaithful Trajectories – Quantum Zeitgeist
Self-consistency Sampling Enhances Outcome-reward-based Reinforcement Learning of Multimodal LLMs, Correcting Unfaithful Trajectories – Quantum Zeitgeist