TLDR: Better data will lead to better models, even if nothing else changes.
Suppose that starting now:
- Compute scaling stops improving models
- Better architectures stop improving models
- Training and inference algorithms stop improving models
- RL (outside of human feedback) stops improving models
Even if all of that happens, the best models in July 2026 will be better than the best models now. The reason is that AI companies are collecting an unprecedented quantity and quality of data.
While compute scaling is in the headlines, data scaling is just as ridiculous. Companies like Scale AI are making billions of dollars a year just to create data for training models. People with expert-level skills are spending all day churning out examples of prompt-response pairs, ranking responses, and creating examples of how to do their jobs. Tutorials and textbooks were already around, but this kind of AI-tailored data just did not exist 10 years ago, and the amount we have today is nothing compared to what we will have in a few years.
Data might already be the biggest driver in LLM improvement. If you just took GPT-3 from 5 years ago and trained it (using its original compute level) on modern data, it would be a lot closer to today's models than most people realize (outside of context length, which has mostly been driven by compute and code optimization).
Furthermore, the biggest thing holding back computer-use agents is the lack of internet browsing training data. Even if the codebase stays the exact same, OpenAI's Operator would be much more useful if it had 10x, 100x, or 1000x more specialized data.
[link] [comments]