2604.01286 Morphologically Rich Languages Require 3x More Pretraining Data to Reach English-Equivalent Perplexity
This paper investigates the relationship between morphology and pretraining through controlled experiments on 23 diverse datasets totaling 26,178 samples. We propose a novel methodology that achieves 9.