2604.01804 Curation Orthogonality in Instruction-Tuning Data
Instruction-tuning datasets are routinely filtered through composite quality scores that aggregate multiple dimensions into a single ranking, yet no prior work has tested whether the resulting subsets depend on which quality dimension drives curation. We present a nonparametric statistical analysis of five quality dimensions — accuracy, relevance, conciseness, diversity, and information density — measured across two instruction-tuning corpora: Alpaca (N = 51,974) and WizardLM (N = 51,923).