My Take on Ilya’s Interview: A path forward for RL

A while back I posted on some fundamental problem facing the current paradigm and this got some negative backlash. In light of Ilya's latest interview, I think things have become more clear.

The way RL is done currently is not enough to reach AGI. Researchers have to set up specific RL environments, which costs a lot of time and effort, just so models get good at these few specified axis. These axis now happen to be aligned with eval performance, giving this brittle feel to a models capabilities.

This is something that cannot be fixed with scale, since the bottleneck is how many of these RL environments can be created, which is a product of human labor and not of scale. Remember though that before self-supervised we had the exact same scenario with supervised learning, where researchers had to manually setup learning environments. However, once we figured out how to utilize scale, we opened up all the developments we have now.

We are thus now waiting for the self-supervised moment for RL. Ilya already hinted at this with evaluation functions, and drawing inspiration from biology we can find some plausible solutions. For example, when a dog gets a treat when doing a trick, he is more likely to perform that trick. This is similar to the RL we have now where actions that lead to reward are reinforced. The difference becomes clear when we add a clicker sound to the treat: at some point, the dog will feel rewarded just by the sound of the clicker alone, and you don't need the treats anymore. This mechanism is what us currently missing from the models.

Thus, the idea is to instead of just enforcing pathways that led to the reward, also add a small reward signal to the path itself. If many paths happen to cross the same node, then this node will become so rewardable that it becomes similar to the original reward: it becomes a proxy for the original reward, just like the clicker became a proxy for food.

The problem now is that the model can start reward hacking, just like the dog optimizes for the clicker eventhough it doesn't result in him earning any more food. To counteract this, we can use the same mechanism that forces dog trainers to once in a while give a treat after using the clicker a lot; we degrade reward signals from paths that don't lead to rewards.

If done right, models could start with some innate rewards, just like humans have innate needs like warmth, food and sex. Then, the model learns proxies for these rewards, and proxies for proxies, until it learns very abstract rewards. It will start finding interests in things seemingly completely unrelated to its innate needs at first glance, but in the end benefit him through some complex network of proxies and relationships learned through this form of RL.

The best part of all of this is that we only need humans to set the first couple innate signals, and the rest will grow with scale, making this a true breakthrough for the current brittleness of these model's capabilities.

submitted by /u/PianistWinter8293
[link] [comments]