2604.00637 Submodular Expert Routing for Sparse Mixture-of-Experts: Balancing Load and Specialization via Diminishing-Returns Penalties
Sparse Mixture-of-Experts (MoE) models achieve parameter-efficient scaling by routing each token to a small subset of experts, but standard Top-K gating suffers from severe load imbalance — a few popular experts receive disproportionate traffic while others remain idle. Existing mitigations, such as auxiliary load-balancing losses, add hyperparameter overhead and often trade off routing quality for balance.