Test-Time Routing Optimization for Multimodal Mixture-of-Experts Models
This paper introduces a test-time optimization method called R2-T2 that improves routing in mixture-of-experts (MoE) models without requiring retraining. The core idea is using gradient descent during inference to optimize how inputs get routed to diff…