Are we actually running out of good data to train AI on?

I’ve been seeing a lot of chatter about how the real bottleneck in AI might not be compute or model size… but the fact that we’re running out of usable training data.

Google DeepMind just shared something called “Generative Data Refinement” basically, instead of throwing away messy/toxic/biased data, they try to rewrite or clean it so it can still be used. Kind of like recycling bad data instead of tossing it out.

At the same time, there’s more pressure for AI content to be watermarked or labeled so people can tell what’s real vs. generated. And on the fun/crazy side, AI edits (like those viral saree/Ghibli style photos) are blowing up, but also freaking people out because they look too real.

So it got me thinking:

Is it smarter to clean/refine the messy data we already have, or focus on finding fresh, “pure” data?
Are we just hiding problems by rewriting data instead of admitting it’s bad?
Should AI content always be labeled and would that even work in practice?
And with trends like hyper-real AI edits, are we already past the point where people can’t tell what’s fake?

What do you all think? Is data scarcity the real limit for AI right now, or is compute still the bigger issue?

submitted by /u/eujzmc
[link] [comments]