DeepMind: Increasing learning rate in small models lets you reproduce errors in large ones

Training giant AI models like GPT-3 requires large resources - thousands of GPUs running for months. As a solo researcher without access to that kind of scale, I can't easily reproduce experiments and findings from papers on huge models.

But a new paper from DeepMind shows you can recreate and study training instabilities seen in massive models by using small ones.

The key is increasing the learning rate:

This reproduces "attention collapse" where the model focuses on just a few tokens, like overfitting.
Also can reproduce "logit divergence" where output values drift unstably.

These issues have been reported when scaling up to billions of params. The cool part is techniques that fix them for giant models also work for small models:

qk-layernorm prevents attention collapse.
Adding a "z-loss" term stops logit divergence.

Some other highlights from the paper include:

Longer warmup helps stability, especially for bigger models.
Decoupling LR and weight decay improves stability.
Depth increases sensitivity much faster than width.
Can predict upcoming issues from scaling trends.
Default epsilon hurts at large scale.

If the authors are right, one more tool that lets researchers study and even help train giant models without Google-size resources. Small models can guide large model development, sort of like how you can build a scale train set to study and improve how a railroad system works... for a lot less money than starting your own railroad company, buying land, building real tracks, etc.

Full summary. Original paper is here.

submitted by /u/Successful-Western27
[link] [comments]