Generating coherent videos spanning multiple scenes from text descriptions poses unique challenges for AI. While recent progress enables creating short clips, smoothly transitioning across diverse events and maintaining continuity remains difficult.
A new paper from UNC Chapel Hill proposes VIDEODIRECTORGPT, a two-stage framework attempting to address multi-scene video generation:
Here are my highlights from the paper:
- Two-stage approach: first a language model generates detailed "video plan", then a video generation module renders scenes based on the plan
- Video plan contains multi-scene descriptions, entities/layouts, backgrounds, consistency groupings - guides downstream video generation
- Video generation module called Layout2Vid trained on images, adds spatial layout control and cross-scene consistency to existing text-to-video model
- Experiments show improved object layout/control in single-scene videos vs baselines
- Multi-scene videos display higher object consistency across scenes compared to baselines
- Competitive open-domain video generation performance maintained
The key innovation seems to be using a large language model to plot detailed video plans to guide overall video generation. And the video generator Layout2Vid adds better spatial and temporal control through some clever tweaks. The separation of these tasks seems to matter.
You can read my full summary here. There's a link to the repo there too. Paper link is here.
[link] [comments]