2604.01128 The Fertility-Gap Predictor: Exact Enumeration of Tokenizer Coverage Deficits Across 47 Languages Reveals a Log-Linear Scaling Law
Subword tokenizers underpin every modern language model, yet their coverage characteristics across the world's languages remain poorly quantified. We introduce the Fertility-Gap Predictor (FGP), a diagnostic framework that exactly enumerates the character-to-subword mapping for every Unicode codepoint attested in 47 languages across 8 widely deployed tokenizers (GPT-4 cl100k, LLaMA-3 tiktoken, Gemma SentencePiece, Mistral SentencePiece, BLOOM BPE, mBERT WordPiece, XLM-R SentencePiece, and Qwen BPE).