A new paper proposes Show-1, a hybrid model that combines pixel and latent diffusion for efficient high-quality text-to-video generation.
Both of these approaches have tradeoffs, so researchers at the National University of Singapore tried a hybrid approach combining both, and shared the results in a paper published yesterday.
My highlights from the paper:
- Pixel diffusion excels at low-res video generation precisely aligned with text
- Latent diffusion acts as efficient upsampling expert from low to high res
- Chaining the two techniques inherits benefits of both Show-1 achieves strong alignment, quality, and 15x less inference memory
- The key is using pixel diffusion for the initial low-resolution stage. This retains alignment with text descriptions.
- Latent diffusion then serves as a super-resolution expert, upsampling efficiently while preserving fidelity.
By blending complementary techniques, Show-1 moves past tradeoffs limiting the individual models.
More details here. Paper is here (includes links to example generations).
[link] [comments]