| | GRPO has some key issues:
That’s why we’re introducing GTPO. It:
On GSM8K, MATH, and AIME 2024, GTPO shows more stable training and better results, both in and out of distribution. You can check out the paper, browse the fully open code on github page, and even try it right now on Colab. By the way, GSPO also just dropped and looks promising. But in the ratio=1 setting it falls back into GRPO’s problems. We haven’t dug into it yet, but that’s next on the list. [link] [comments] |