How hard is it to train a video generation AI from scratch?

People talk about video generation AI like it just suddenly appeared, but I’m curious what the actual training process looks like underneath. Not talking about building the next Sora or Veo, just training a tiny experimental video model to understand the workflow.

Image generation already seems complicated, but video feels like a completely different level because now the model has to understand motion, consistency, timing, objects changing frame by frame, camera movement, physics, and temporal coherence.

It makes me wonder what the real bottleneck is. Is it compute, video data, architecture, evaluation, or just the fact that video has way more moving parts than images?

submitted by /u/Raman606surrey
[link] [comments]