Show-1: Marrying Pixel and Latent Diffusion Models for Efficient and High-Quality Text-to-Video Generation
Show-1: Marrying Pixel and Latent Diffusion Models for Efficient and High-Quality Text-to-Video Generation

Show-1: Marrying Pixel and Latent Diffusion Models for Efficient and High-Quality Text-to-Video Generation

A new paper proposes Show-1, a hybrid model that combines pixel and latent diffusion for efficient high-quality text-to-video generation.

Both of these approaches have tradeoffs, so researchers at the National University of Singapore tried a hybrid approach combining both, and shared the results in a paper published yesterday.

My highlights from the paper:

  • Pixel diffusion excels at low-res video generation precisely aligned with text
  • Latent diffusion acts as efficient upsampling expert from low to high res
  • Chaining the two techniques inherits benefits of both Show-1 achieves strong alignment, quality, and 15x less inference memory
  • The key is using pixel diffusion for the initial low-resolution stage. This retains alignment with text descriptions.
  • Latent diffusion then serves as a super-resolution expert, upsampling efficiently while preserving fidelity.

By blending complementary techniques, Show-1 moves past tradeoffs limiting the individual models.

More details here. Paper is here (includes links to example generations).

submitted by /u/Successful-Western27
[link] [comments]