Recent Paper shows Scaling won’t work for generalizing outside of Training Data

For a video on this click here.

I recently came across an intriguing paper (https://arxiv.org/html/2406.06489v1) that tested various machine learning models, including a transformer-based language model, on out-of-distribution (OOD) prediction tasks. The authors discovered that simply making neural networks larger doesn't improve their performance on these OOD tasks—and might even make it worse. They argue that scaling up models isn't the solution for achieving genuine understanding beyond their training data.

This finding contrasts with many studies on "grokking," where neural networks suddenly start to generalize well after extended training. According to the new paper, the generalization seen in grokking is too simplistic and doesn't represent true OOD generalization.

However, I have a different perspective on why this new paper's results differ from grokking studies. Grokking often involves very simple tasks—like basic logical operations—where there's a high diversity of input data, but the underlying rule the model needs to learn is straightforward. With enough training and proper regularization, the model finds it more efficient to learn the simple rule rather than memorize all the training examples.

In contrast, the new paper deals with material science, a field with highly complex underlying rules but limited data diversity (since there are only 118 chemical elements). In this scenario, the model tends to memorize the data because it's computationally cheaper than trying to learn the complex underlying relationships.

Think about it this way: To memorize information about 118 elements, a model might need just around 118 parameters. But to understand and store the complex rules governing material properties, it would require many more parameters. This leads us to propose:

Tendency to Generalize ∝ Input Diversity ÷ Complexity of Underlying Rule

The paper supports this idea by showing that models generalize better OOD when they focus on predicting structures using only the most relevant features—a subset of the total features and target variables. Here, the underlying rule is simpler because there are fewer inputs and outputs involved. We can further refine our equation by considering that the complexity of the underlying rule increases with the number of relevant input dimensions and output dimensions:

Complexity of Underlying Rule ∝ Relevant Input Dimensions × Output Dimensions

Therefore:

Tendency to Generalize ∝ Input Diversity ÷ (Relevant Input Dimensions × Output Dimensions)

In simpler terms, a model's ability to generalize depends on how diverse the input data is and is inversely related to the complexity of what it's trying to learn. The more diverse your data, the better the model can handle complex problems.

I believe that current scaling laws for neural networks show improvements not just on in-distribution (ID) data, as the new paper suggests, but also on OOD tasks where the underlying rule is simple or where there's high data diversity—similar to the tasks explored in grokking studies.

This implies that for certain tasks—like those in material science—where data diversity is low and the underlying rules are complex, large language models (LLMs) won't naturally generalize; they'll resort to memorization. This isn't too surprising. Imagine you're given a list of 10 elements, each with 100 attributes, and you're asked to predict their ionization energies. Would you try to decipher the intricate interactions among all those attributes, or would you just memorize the ionization energies? In such cases, memorization seems more practical.

Humans, however, might attempt to uncover the underlying rule for ionization energy, even with limited data and complex relationships. We might hypothesize based on 9 of the 10 elements and test our predictions on the 10th, refining our understanding iteratively. This approach is akin to leave-one-out cross-validation in machine learning. While I'm not suggesting we adopt this exact method to improve model generalization, validating models with an OOD subset seems crucial for measuring and enhancing their ability to generalize beyond their training data.

In conclusion, this paper highlights that unless we develop new training methodologies, current models will continue to struggle with certain OOD tasks due to limitations in data diversity and the complexity of underlying rules.

submitted by /u/PianistWinter8293
[link] [comments]