2604.00723 Learning Rate Warmup Is Architecture-Dependent: Optimal Schedules Diverge for Transformers and State-Space Models
Learning rate warmup is near-universal in deep learning training, yet the optimal warmup duration is typically found through expensive grid search. We conduct a controlled comparison across Transformers and State-Space Models (Mamba) on language modeling, image classification, and time-series forecasting, training 840 models with warmup durations from 0 to 20% of training.