GTPO: a more stable alternative to GRPO for LLM training

GRPO has some key issues:

Tokens show up in both positive and negative completions, which leads to conflicting updates that break structure.Negative completions push the model toward unlikely tokens, flattening the distribution and hurting learning.

That’s why we’re introducing GTPO. It:

Detects and protects “conflict tokens” (skipping harmful updates, boosting helpful ones).
Filters out noisy, high-entropy completions.
Works without KL-divergence regularization or a reference model.

On GSM8K, MATH, and AIME 2024, GTPO shows more stable training and better results, both in and out of distribution.

You can check out the paper, browse the fully open code on github page, and even try it right now on Colab.

By the way, GSPO also just dropped and looks promising. But in the ratio=1 setting it falls back into GRPO’s problems. We haven’t dug into it yet, but that’s next on the list.

submitted by /u/Gildarts777
[link] [comments]