This new paper poses a real threat to scaling RL

https://www.arxiv.org/abs/2504.13837
One finding of this paper is that as we scale RL, there will be problems that the model gets worse and worse at solving. GRPO and other RL on exact reward methods get stuck on local optima due to their lack of exploration compared to things like MCTS. This means that just simply scaling RL using things like GRPO won't solve all problems.

The premise of solving all problems using RL is still theoretically feasible, if the exploration is high enough such that methods don't get stuck in local optima. The crux is that the current paradigm doesn't use these methods yet (at least not that I or this paper is aware of).

I highlighted these results from the paper, although the focus of the paper was mainly on the model's reasoning ability being restrained by the base model's capacity. I don't believe this is much of a problem, considering that base models are stochastic and could, in theory, almost solve any problem given enough k passes (think of the Library of Babel). RL, then, is just about reducing the number of k passes needed to solve it correctly. So, say we need k=100000000 passes to figure out relativity theory given Einstein's priors before he figured it out, then RL could reduce this k to k=1 in theory. The problem then is that current methods won't be able to get you from k=100000000 to k=1 because it will get stuck in local optima such that k will increase instead of decrease.

submitted by /u/PianistWinter8293
[link] [comments]