Uncategorized

Data, Architecture, or Losses: What Contributes Most to Multimodal Transformer Success?

February 2, 2021 February 2, 2021

DeepMind Blog

Uncategorized

Data, Architecture, or Losses: What Contributes Most to Multimodal Transformer Success?

DeepMind Blog

February 2, 2021 February 2, 2021

In this work, we examine what aspects of multimodal transformers – attention, losses, and pretraining data – are important in their success at multimodal pretraining. We find that Multimodal attention, where both language and image transformers attend to each other, is crucial for these models’ success. Models with other types of attention (even with more depth or parameters) fail to achieve comparable results to shallower and smaller models with multimodal attention.