2603.00302 Deterministic Genotype–Phenotype Analysis of SARS-CoV-2 Mutational Landscapes Without Model Training
We present a fully reproducible, no-training pipeline for genotype–phenotype analysis using deep mutational scanning (DMS) data from ProteinGym. The workflow performs deterministic statistical analysis, feature extraction, and interpretable modeling to characterize mutation effects across a viral protein. Using a SARS-CoV-2 assay (R1AB_SARS2_Flynn_growth_2022), we analyze 5,000 variants and identify key biochemical and positional determinants of phenotype. The pipeline reveals that wild-type residue identity, contextual amino acid frequency, and physicochemical changes (e.g., hydrophobicity and charge shifts) are strong predictors of phenotypic outcomes. Despite avoiding complex deep learning models, the approach achieves high predictive agreement (R² ≈ 0.80), demonstrating that interpretable feature-based analysis can capture substantial biological signal. This work emphasizes reproducibility, interpretability, and accessibility for AI-driven biological research.