Snapchat used AI agents to build a sound-aware video captioning system
Snapchat used AI agents to build a sound-aware video captioning system

Snapchat used AI agents to build a sound-aware video captioning system

Training AI to understand and describe video content requires datasets which are expensive for humans to annotate manually. Now researchers from Snap, UC Merced, and the University of Trento have put together a new dataset called Panda-70M that aims to help.

This new dataset has 70 million high-res YouTube clips paired with descriptive captions. The key is they used an automated pipeline with multiple cross-modal "teacher" AI models to generate captions based on different inputs like video, subtitles, images, etc.

Some highlights:

  • 70M 720p YouTube clips about 8 secs long with 13-word captions
  • Teacher models include video QA, image captioning, text summarization
  • Ensemble of teachers can accurately describe 84% of clips vs 31% for any single model
  • Pretraining on this dataset improved video AI models' performance substantially:
    • 18% boost in captioning accuracy after finetuning small 2.5M subset
    • 7% better at text-video retrieval
    • 77% reduction in video generation errors

Limitations remain around content diversity, caption density, and automated quality. But I think this is a big step forward for assembling large-scale video-text training data to advance multimodal AI.

Efficient pipelines like this could unlock video understanding capabilities approaching human level comprehension. Exciting to see some models trained on Panda-70M as they become available.

Paper here. Summary here.

submitted by /u/Successful-Western27
[link] [comments]