Multimodal sentiment analysis combines text, audio and video to understand human emotions. But extra inputs can add irrelevant or conflicting signals. So filtering matters.
Researchers made a "Adaptive Language-guided Multimodal Transformer" (ALMT) that uses text to guide filtering of visual and audio data. This creates a "hyper-modality" with less noise that complements the text.
They tested it on datasets like MOSI (YouTube reviews), MOSEI (YouTube clips) and CH-SIMS (Chinese videos). ALMT achieved improved accuracy:
- MOSI: YouTube movie reviews with 2,199 samples. ALMT achieves state-of-the-art performance on various metrics including 6% higher 7-class accuracy.
- MOSEI: 22,856 YouTube clips covering sentiment-rich scenarios. ALMT improves multi-class accuracy by 3-5% over previous methods.
- CH-SIMS: Chinese dataset with over 2,000 video samples. ALMT surpasses prior work by 1.4% in binary accuracy.
Analyses showed big drops in performance without the guided filtering, so this validates that it's the main innovation.
Downsides are it needs lots of training data and has minor gains on sparse regression metrics. But overall the technique of filtering multimodal data under text guidance gives improvements.
The concepts feel intuitive - use dominant signals to filter others and retain useful complements. My guess is it would transfer well to other multimodal tasks.
TLDR: New way to filter multimodal data for sentiment analysis using text guidance improves performance. Shows the value in removing distracting signals. Sometimes less is more.
Full summary here. Paper is here.
[link] [comments]