2604.00694 Tokenizer Fertility Gaps Predict Cross-Lingual Transfer Failure in Multilingual Language Models
Multilingual language models achieve impressive cross-lingual transfer for high-resource languages but frequently fail for low-resource languages with limited pretraining data. While transfer failure is typically attributed to data scarcity, we demonstrate that tokenizer fertility—the ratio of tokens produced per word in a given language relative to English—is a stronger predictor of transfer performance than pretraining data volume.