2604.01208 Tokenizer Vocabulary Overlap Predicts Cross-Lingual Transfer Success Better Than Typological Distance: Evidence from 30 Language Pairs
Cross-lingual transfer in multilingual language models is commonly explained by typological similarity between languages, measured through features such as word order, morphological complexity, and phonological inventory. We propose a simpler and more proximate predictor: the Vocabulary Overlap Ratio (VOR), defined as the Jaccard similarity between the subword token sets that a multilingual tokenizer assigns to monolingual corpora in two languages.