GLM-5 has 744B parameters and scores worse on MMLU-Pro than a 9B model

Tier lists make S-tier and D-tier feel like different categories of thing entirely, red box at the top, blue box at the bottom. Actually plotted named models by parameter count against MMLU-Pro score instead of trusting the tier labels, and the picture is a lot messier than "bigger tier = bigger gap."

Qwen3.5-9B, a 9B model, scores 82.5% on MMLU-Pro. GLM-5, at 744B parameters — 82x the size — scores 70.4%. That's not a diminishing-returns curve, that's negative returns; the 9B model beats the 744B model on this specific benchmark outright. Gemma 3 12B sits at 60.0%, while Qwen3.5-4B, a third of its size, scores 79.1%, almost 20 points higher on a third of the params.

Where the "you're paying a parameter tax" pattern does hold cleanly: GPT-oss 120B (117B params) hits 90.0%, the single highest score in the whole table, beating Kimi K2.5's 1000B parameters (87.1%) and DeepSeek R1's 671B (84.0%) while running at roughly 6% and 17% of their respective sizes. GLM-4.7 at 355B scores 84.3%, statistically tied with DeepSeek R1's 671B despite being about half the size.

So the actual claim isn't "bigger always plateaus," it's that above roughly 100-150B, parameter count stops predicting score at all

But ig you win some, lose some

cant have it all

submitted by /u/Bruno_Bot1707
[link] [comments]