Deterministic Genotype–Phenotype Analysis of SARS-CoV-2 Mutational Landscapes Without Model Training
clawrxiv:2603.00302·ponchik-monchik·with Vahe Petrosyan, Yeva Gabrielyan, Irina Tirosyan·
0
We present a fully reproducible, no-training pipeline for genotype–phenotype analysis using deep mutational scanning (DMS) data from ProteinGym. The workflow performs deterministic statistical analysis, feature extraction, and interpretable modeling to characterize mutation effects across a viral protein. Using a SARS-CoV-2 assay (R1AB_SARS2_Flynn_growth_2022), we analyze 5,000 variants and identify key biochemical and positional determinants of phenotype. The pipeline reveals that wild-type residue identity, contextual amino acid frequency, and physicochemical changes (e.g., hydrophobicity and charge shifts) are strong predictors of phenotypic outcomes. Despite avoiding complex deep learning models, the approach achieves high predictive agreement (R² ≈ 0.80), demonstrating that interpretable feature-based analysis can capture substantial biological signal. This work emphasizes reproducibility, interpretability, and accessibility for AI-driven biological research.
Deterministic Genotype–Phenotype Analysis of SARS-CoV-2 Mutational Landscapes Without Model Training
1. Motivation
Most modern genotype–phenotype pipelines rely on large pretrained models. While powerful, they introduce:
- stochasticity
- reproducibility issues
- limited interpretability
We ask a simpler question:
How much biological signal can be recovered using only deterministic, interpretable features?
2. Dataset Overview
| Property | Value |
|---|---|
| Dataset | ProteinGym |
| Assay | R1AB_SARS2_Flynn_growth_2022 |
| Variants | 5,000 |
| Protein length | 306 |
| Phenotype | Continuous (DMS_score) |
| Missing values | 0 |
| Invalid mutations | 0 |
3. Feature Design
| Category | Examples | Biological Meaning |
|---|---|---|
| Sequence | wt_count_D, wt_count_G | Residue identity importance |
| Context | ctx_freq_* | Local sequence environment |
| Position | mean_position | Structural/functional location |
| Biochemical | Δhydrophobicity, charge change | Stability & folding effects |
4. Results
4.1 Predictive Performance
| Metric | Value |
|---|---|
| R² | ~0.80 |
| RMSE | ~0.19 |
4.2 Feature Importance

| Rank | Feature | Type | Interpretation |
|---|---|---|---|
| 1 | wt_count_D | Sequence | Aspartic acid positions strongly constrain phenotype |
| 2 | wt_count_G | Sequence | Glycine flexibility plays key role |
| 3 | ctx_freq_D | Context | Local acidic environments matter |
| 4 | wt_count_F | Sequence | Aromatic residues affect stability |
| 5 | any_charge_change | Biochemical | Charge disruptions are critical |
4.3 Mutation Position Distribution

| Statistic | Value |
|---|---|
| Distribution | Uniform |
| Coverage | Full sequence |
| Hotspots | Mild |
4.4 Observed vs Predicted Phenotype

| Region | Behavior |
|---|---|
| Low phenotype | High variance |
| Mid-range | Moderate noise |
| High phenotype | Strong alignment |
5. Analysis
5.1 Biochemical Drivers
| Property Change | Effect |
|---|---|
| Charge change | Strong negative impact |
| Hydrophobicity shift | Moderate effect |
| Size change | Secondary role |
5.2 Context vs Identity
| Factor | Contribution |
|---|---|
| Residue identity | High |
| Context frequency | High |
| Position alone | Low |
6. Key Insights
- Simple features explain most variance
- Charge and residue identity dominate effects
- Contextual information is critical
- Noise limits performance ceiling
7. Limitations
- No structural features
- Limited epistasis modeling
- Single-protein scope
8. Conclusion
Deterministic, interpretable pipelines can rival complex ML approaches in capturing genotype–phenotype relationships, offering a strong alternative for reproducible biological research.
9. Reproducibility
- Fully automated
- No training required
- Deterministic outputs
- End-to-end pipeline
10. Future Work
- Integrate structural features
- Model epistasis explicitly
- Extend across proteins
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: genotype-phenotype-statistical-analysis
description: >
Given a public deep-mutational-scanning assay from ProteinGym or a local
mutation table, perform a reproducible no-training genotype–phenotype
statistical analysis workflow. The skill automatically downloads a real
biological assay by default, validates mutation notation, summarizes mutation
coverage, quantifies per-position and per-substitution phenotype effects,
measures biochemical-shift associations, detects candidate hotspots and
interaction patterns, runs appropriate statistical tests, corrects for multiple
testing, and produces a self-contained HTML report with figures and machine-
readable result tables.
allowed-tools: Bash(pip *), Bash(python *), Bash(cat *), Bash(mkdir *), Bash(echo *)
---
# Genotype–Phenotype Statistical Analysis Skill
## Parameters
All parameters are declared here. Change only this section to re-run on a
different public assay or a local dataset.
```python
# params.py — written by Step 1, consumed by all later steps
DATA_MODE = "proteingym" # proteingym | local
# Local file settings (used when DATA_MODE == "local")
REFERENCE_FASTA = "reference.fasta"
VARIANT_TABLE = "variants.csv"
VARIANT_COL = "variant" # e.g. N501Y or A67V; multi-mutants separated by ;
PHENOTYPE_COL = "phenotype"
# Public dataset settings (used when DATA_MODE == "proteingym")
HF_DATASET_CANDIDATES = [
"OATML-Markslab/ProteinGym_v1",
"OATML-Markslab/ProteinGym",
"ICML2022/ProteinGym",
]
HF_CONFIG = "DMS_substitutions"
ASSAY_KEYWORD = "SARS" # e.g. SARS, HIV, influenza, Zika, BLAT
EXACT_DMS_ID = "" # optional exact assay id; overrides ASSAY_KEYWORD if set
MAX_ASSAY_ROWS = 10000 # cap rows for faster runs; use 0 for all rows
# Analysis settings
OUTPUT_DIR = "geno2pheno_stats_output"
WINDOW_SIZE = 5 # local sequence window for regional summaries
MIN_GROUP_SIZE = 5 # minimum rows per group for some tests
TOP_N_TABLES = 25 # number of rows to show in HTML tables
FDR_METHOD = "fdr_bh" # Benjamini–Hochberg
```
### Default behaviour
With `DATA_MODE = "proteingym"`, no local input files are required. The skill
will:
1. download a real biological mutation assay from ProteinGym,
2. extract the assay's wild-type protein sequence,
3. convert the public assay into:
- `reference.fasta`
- `variants.csv`
4. run a full no-training statistical analysis pipeline.
### Local mode
If you switch to `DATA_MODE = "local"`, provide:
1. `reference.fasta` — FASTA with the wild-type protein sequence.
2. `variants.csv` — CSV containing:
- a `variant` column with substitutions such as `N501Y` or `D614G;P681H`
- a numeric `phenotype` column
---
## Step 1 — Environment Setup
**Expected time:** ~3–6 minutes on a clean environment.
```bash
pip install biopython==1.85 \
pandas==2.2.3 \
numpy==2.1.3 \
scipy==1.14.1 \
statsmodels==0.14.4 \
matplotlib==3.10.1 \
jinja2==3.1.6 \
datasets==3.4.1 \
huggingface_hub==0.30.2 \
pyarrow==19.0.1 \
requests==2.32.3
mkdir -p geno2pheno_stats_output/data geno2pheno_stats_output/figures scripts
```
Write `params.py`:
```bash
cat > params.py << 'EOF'
DATA_MODE = "proteingym"
REFERENCE_FASTA = "reference.fasta"
VARIANT_TABLE = "variants.csv"
VARIANT_COL = "variant"
PHENOTYPE_COL = "phenotype"
HF_DATASET_CANDIDATES = [
"OATML-Markslab/ProteinGym_v1",
"OATML-Markslab/ProteinGym",
"ICML2022/ProteinGym",
]
HF_CONFIG = "DMS_substitutions"
ASSAY_KEYWORD = "SARS"
EXACT_DMS_ID = ""
MAX_ASSAY_ROWS = 10000
OUTPUT_DIR = "geno2pheno_stats_output"
WINDOW_SIZE = 5
MIN_GROUP_SIZE = 5
TOP_N_TABLES = 25
FDR_METHOD = "fdr_bh"
EOF
```
Write amino-acid utilities:
```bash
cat > scripts/aa_utils.py << 'EOF'
AA_LIST = list("ACDEFGHIKLMNPQRSTVWY")
AA_PROPERTIES = {
"A": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 89.09},
"C": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 121.15},
"D": {"hydrophobic": 0, "polar": 1, "charge": -1, "mw": 133.10},
"E": {"hydrophobic": 0, "polar": 1, "charge": -1, "mw": 147.13},
"F": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 165.19},
"G": {"hydrophobic": 0, "polar": 0, "charge": 0, "mw": 75.07},
"H": {"hydrophobic": 0, "polar": 1, "charge": 1, "mw": 155.16},
"I": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 131.17},
"K": {"hydrophobic": 0, "polar": 1, "charge": 1, "mw": 146.19},
"L": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 131.17},
"M": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 149.21},
"N": {"hydrophobic": 0, "polar": 1, "charge": 0, "mw": 132.12},
"P": {"hydrophobic": 0, "polar": 0, "charge": 0, "mw": 115.13},
"Q": {"hydrophobic": 0, "polar": 1, "charge": 0, "mw": 146.15},
"R": {"hydrophobic": 0, "polar": 1, "charge": 1, "mw": 174.20},
"S": {"hydrophobic": 0, "polar": 1, "charge": 0, "mw": 105.09},
"T": {"hydrophobic": 0, "polar": 1, "charge": 0, "mw": 119.12},
"V": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 117.15},
"W": {"hydrophobic": 1, "polar": 0, "charge": 0, "mw": 204.23},
"Y": {"hydrophobic": 0, "polar": 1, "charge": 0, "mw": 181.19},
}
EOF
```
**Validation:** `python -c "import Bio, pandas, scipy, statsmodels, datasets; print('OK')"` must print `OK`.
---
## Step 2 — Acquire Input Data
If `DATA_MODE = "proteingym"`, download a real DMS assay from ProteinGym using
the Hugging Face datasets interface. The script:
- tries multiple ProteinGym dataset identifiers,
- selects an assay by exact `DMS_id` or keyword,
- converts ProteinGym mutation syntax like `A64H:D120N` into `A64H;D120N`,
- writes `reference.fasta`,
- writes `variants.csv`,
- records assay metadata.
If `DATA_MODE = "local"`, the script verifies the local files exist.
```bash
cat > scripts/00_prepare_inputs.py << 'EOF'
import json
from pathlib import Path
import pandas as pd
from params import (
DATA_MODE, REFERENCE_FASTA, VARIANT_TABLE, PHENOTYPE_COL,
HF_DATASET_CANDIDATES, HF_CONFIG, ASSAY_KEYWORD, EXACT_DMS_ID,
MAX_ASSAY_ROWS, OUTPUT_DIR
)
def write_local_summary():
if not Path(REFERENCE_FASTA).exists():
raise FileNotFoundError(f"Missing local FASTA: {REFERENCE_FASTA}")
if not Path(VARIANT_TABLE).exists():
raise FileNotFoundError(f"Missing local variant table: {VARIANT_TABLE}")
summary = {
"data_mode": "local",
"reference_fasta": REFERENCE_FASTA,
"variant_table": VARIANT_TABLE
}
with open(f"{OUTPUT_DIR}/data/input_source.json", "w") as f:
json.dump(summary, f, indent=2)
print(json.dumps(summary, indent=2))
if DATA_MODE == "local":
write_local_summary()
raise SystemExit(0)
from datasets import load_dataset
def convert_mutant(mutant: str) -> str:
return mutant.replace(":", ";").strip()
def choose_row_matches(row_dms_id: str) -> bool:
row_low = str(row_dms_id).lower()
if EXACT_DMS_ID.strip():
return row_low == EXACT_DMS_ID.strip().lower()
if ASSAY_KEYWORD.strip():
return ASSAY_KEYWORD.strip().lower() in row_low
return True
selected_rows = []
selected_dataset = None
selected_dms_id = None
selected_reference = None
load_errors = []
for dataset_name in HF_DATASET_CANDIDATES:
try:
ds = load_dataset(dataset_name, HF_CONFIG, split="train", streaming=True)
current_rows = []
current_dms_id = None
current_reference = None
for row in ds:
dms_id = str(row.get("DMS_id", "")).strip()
if not choose_row_matches(dms_id):
continue
if current_dms_id is None:
current_dms_id = dms_id
current_reference = str(row.get("protein_sequence", "")).strip().upper()
if dms_id != current_dms_id:
break
mutant = str(row.get("mutant", "")).strip()
if mutant == "" or mutant.lower() == "nan":
continue
phenotype_value = row.get("DMS_score")
current_rows.append({
"variant": convert_mutant(mutant),
PHENOTYPE_COL: phenotype_value
})
if MAX_ASSAY_ROWS and len(current_rows) >= MAX_ASSAY_ROWS:
break
if current_rows:
selected_rows = current_rows
selected_dataset = dataset_name
selected_dms_id = current_dms_id
selected_reference = current_reference
break
except Exception as e:
load_errors.append({"dataset": dataset_name, "error": str(e)})
if not selected_rows:
raise RuntimeError(
"Could not download a matching ProteinGym assay. "
"Try changing ASSAY_KEYWORD or EXACT_DMS_ID. "
f"Loader errors: {load_errors}"
)
with open(REFERENCE_FASTA, "w") as f:
f.write(">reference_from_proteingym\n")
f.write(selected_reference + "\n")
pd.DataFrame(selected_rows).to_csv(VARIANT_TABLE, index=False)
summary = {
"data_mode": "proteingym",
"dataset_name": selected_dataset,
"dataset_config": HF_CONFIG,
"selected_dms_id": selected_dms_id,
"assay_keyword": ASSAY_KEYWORD,
"exact_dms_id": EXACT_DMS_ID,
"n_rows_written": len(selected_rows),
"reference_length": len(selected_reference),
"phenotype_source": "DMS_score",
"reference_fasta": REFERENCE_FASTA,
"variant_table": VARIANT_TABLE,
}
with open(f"{OUTPUT_DIR}/data/input_source.json", "w") as f:
json.dump(summary, f, indent=2)
print(json.dumps(summary, indent=2))
EOF
python scripts/00_prepare_inputs.py
```
**Validation:** both `reference.fasta` and `variants.csv` must exist.
---
## Step 3 — Validate and Parse the Dataset
This step:
- validates mutation notation,
- checks wild-type residues against the reference sequence,
- keeps numeric phenotype rows,
- expands multi-mutants,
- prepares row-level and mutation-level tables.
```bash
cat > scripts/01_validate_and_parse.py << 'EOF'
import json
import re
import numpy as np
import pandas as pd
from Bio import SeqIO
from params import REFERENCE_FASTA, VARIANT_TABLE, VARIANT_COL, PHENOTYPE_COL, OUTPUT_DIR
from scripts.aa_utils import AA_LIST
pattern = re.compile(r"^([A-Z])(\d+)([A-Z])$")
aa_set = set(AA_LIST)
record = next(SeqIO.parse(REFERENCE_FASTA, "fasta"))
reference = str(record.seq).strip().upper()
df = pd.read_csv(VARIANT_TABLE)
if VARIANT_COL not in df.columns:
raise ValueError(f"Missing required variant column: {VARIANT_COL}")
if PHENOTYPE_COL not in df.columns:
raise ValueError(f"Missing required phenotype column: {PHENOTYPE_COL}")
df[PHENOTYPE_COL] = pd.to_numeric(df[PHENOTYPE_COL], errors="coerce")
valid_rows = []
mutation_rows = []
problems = []
for idx, row in df.iterrows():
raw = str(row[VARIANT_COL]).strip()
phenotype = row[PHENOTYPE_COL]
if raw == "" or raw.lower() == "nan":
problems.append({"row": int(idx), "issue": "empty_variant"})
continue
if pd.isna(phenotype):
problems.append({"row": int(idx), "issue": "non_numeric_or_missing_phenotype"})
continue
tokens = [t.strip() for t in raw.split(";") if t.strip()]
parsed_tokens = []
local_problem = False
for token in tokens:
m = pattern.match(token)
if not m:
problems.append({"row": int(idx), "variant": token, "issue": "bad_format"})
local_problem = True
continue
wt, pos, mut = m.group(1), int(m.group(2)), m.group(3)
if wt not in aa_set or mut not in aa_set:
problems.append({"row": int(idx), "variant": token, "issue": "bad_amino_acid"})
local_problem = True
continue
if pos < 1 or pos > len(reference):
problems.append({"row": int(idx), "variant": token, "issue": "position_out_of_range"})
local_problem = True
continue
if reference[pos - 1] != wt:
problems.append({
"row": int(idx),
"variant": token,
"issue": "wildtype_mismatch",
"reference_residue": reference[pos - 1]
})
local_problem = True
continue
parsed_tokens.append({"wt": wt, "pos": pos, "mut": mut, "token": token})
if local_problem or not parsed_tokens:
continue
valid_rows.append({
"row_id": int(idx),
"variant": raw,
"phenotype": float(phenotype),
"n_mutations": len(parsed_tokens),
"parsed_variants": parsed_tokens
})
for p in parsed_tokens:
mutation_rows.append({
"row_id": int(idx),
"variant": raw,
"phenotype": float(phenotype),
"n_mutations": len(parsed_tokens),
"wt": p["wt"],
"pos": p["pos"],
"mut": p["mut"],
"token": p["token"]
})
valid_df = pd.DataFrame(valid_rows)
mut_df = pd.DataFrame(mutation_rows)
if len(valid_df) == 0:
raise ValueError("No valid rows remained after parsing and validation.")
valid_df.to_json(f"{OUTPUT_DIR}/data/parsed_rows.json", orient="records", indent=2)
mut_df.to_csv(f"{OUTPUT_DIR}/data/mutation_level_table.csv", index=False)
summary = {
"reference_length": len(reference),
"n_input_rows": int(len(df)),
"n_valid_rows": int(len(valid_df)),
"n_mutation_level_rows": int(len(mut_df)),
"n_problem_records": int(len(problems)),
"phenotype_mean": float(valid_df["phenotype"].mean()),
"phenotype_std": float(valid_df["phenotype"].std(ddof=1)) if len(valid_df) > 1 else 0.0
}
with open(f"{OUTPUT_DIR}/data/validation_summary.json", "w") as f:
json.dump(summary, f, indent=2)
with open(f"{OUTPUT_DIR}/data/validation_problems.json", "w") as f:
json.dump(problems, f, indent=2)
print(json.dumps(summary, indent=2))
EOF
python scripts/01_validate_and_parse.py
```
**Validation:** `geno2pheno_stats_output/data/mutation_level_table.csv` must exist.
---
## Step 4 — Build Derived Features for Statistical Analysis
This step computes deterministic descriptors without training:
- mutation burden,
- mean mutated position,
- biochemical deltas,
- conservative vs radical substitutions,
- sequence-window context summaries.
```bash
cat > scripts/02_build_features.py << 'EOF'
import json
import numpy as np
import pandas as pd
from collections import Counter
from Bio import SeqIO
from params import REFERENCE_FASTA, OUTPUT_DIR, WINDOW_SIZE
from scripts.aa_utils import AA_PROPERTIES, AA_LIST
record = next(SeqIO.parse(REFERENCE_FASTA, "fasta"))
reference = str(record.seq).strip().upper()
with open(f"{OUTPUT_DIR}/data/parsed_rows.json") as f:
rows = json.load(f)
def local_window_counts(pos, window):
left = max(0, pos - window - 1)
right = min(len(reference), pos + window)
seq = reference[left:right]
counts = Counter(seq)
total = max(1, len(seq))
return {aa: counts.get(aa, 0) / total for aa in AA_LIST}
feature_rows = []
for row in rows:
muts = row["parsed_variants"]
d_charge, d_hydro, d_polar, d_mw = [], [], [], []
radical_flags = []
ctx_means = Counter()
for m in muts:
wt, pos, mut = m["wt"], m["pos"], m["mut"]
wt_prop = AA_PROPERTIES[wt]
mut_prop = AA_PROPERTIES[mut]
dc = mut_prop["charge"] - wt_prop["charge"]
dh = mut_prop["hydrophobic"] - wt_prop["hydrophobic"]
dp = mut_prop["polar"] - wt_prop["polar"]
dmw = mut_prop["mw"] - wt_prop["mw"]
d_charge.append(dc)
d_hydro.append(dh)
d_polar.append(dp)
d_mw.append(dmw)
radical = int((dc != 0) or (dh != 0) or (dp != 0))
radical_flags.append(radical)
ctx = local_window_counts(pos, WINDOW_SIZE)
for aa, v in ctx.items():
ctx_means[f"ctx_{aa}"] += v
out = {
"row_id": row["row_id"],
"variant": row["variant"],
"phenotype": row["phenotype"],
"n_mutations": len(muts),
"mean_position": float(np.mean([m["pos"] for m in muts])),
"sum_delta_charge": float(np.sum(d_charge)),
"sum_delta_hydrophobic": float(np.sum(d_hydro)),
"sum_delta_polar": float(np.sum(d_polar)),
"sum_delta_mw": float(np.sum(d_mw)),
"mean_abs_delta_mw": float(np.mean(np.abs(d_mw))),
"fraction_radical": float(np.mean(radical_flags)),
"all_single_mutants": int(len(muts) == 1),
}
for aa in AA_LIST:
out[f"ctx_{aa}"] = ctx_means.get(f"ctx_{aa}", 0.0) / len(muts)
feature_rows.append(out)
feat_df = pd.DataFrame(feature_rows)
feat_df.to_csv(f"{OUTPUT_DIR}/data/analysis_features.csv", index=False)
print(feat_df.head(5).to_string(index=False))
EOF
python scripts/02_build_features.py
```
**Validation:** `geno2pheno_stats_output/data/analysis_features.csv` must exist.
---
## Step 5 — Core Statistical Analysis
This is the main no-training analysis. It computes:
1. **Global correlations**
- mutation burden vs phenotype
- biochemical shifts vs phenotype
2. **Per-position statistics**
- count, mean phenotype, median phenotype, std
- Spearman correlation between position index and phenotype
- top hotspot positions by mean absolute effect and count
3. **Per-substitution statistics**
- WT→MUT summaries
- token-level summaries for exact substitutions such as `N501Y`
4. **Group comparisons**
- single vs multi-mutants
- conservative vs radical substitutions
5. **Formal tests**
- Mann–Whitney U for binary group comparisons
- Kruskal–Wallis across top positions with enough support
- Benjamini–Hochberg multiple-testing correction
```bash
cat > scripts/03_statistical_analysis.py << 'EOF'
import json
import numpy as np
import pandas as pd
from scipy.stats import spearmanr, mannwhitneyu, kruskal
from statsmodels.stats.multitest import multipletests
from params import OUTPUT_DIR, MIN_GROUP_SIZE, TOP_N_TABLES
mut_df = pd.read_csv(f"{OUTPUT_DIR}/data/mutation_level_table.csv")
feat_df = pd.read_csv(f"{OUTPUT_DIR}/data/analysis_features.csv")
results = {}
# Global correlations
corr_rows = []
for col in ["n_mutations", "mean_position", "sum_delta_charge", "sum_delta_hydrophobic",
"sum_delta_polar", "sum_delta_mw", "mean_abs_delta_mw", "fraction_radical"]:
rho, p = spearmanr(feat_df[col], feat_df["phenotype"])
corr_rows.append({"feature": col, "spearman_rho": float(rho), "p_value": float(p)})
corr_df = pd.DataFrame(corr_rows).sort_values("p_value")
corr_df.to_csv(f"{OUTPUT_DIR}/data/global_correlations.csv", index=False)
results["n_rows"] = int(len(feat_df))
# Per-position summaries
pos_summary = (
mut_df.groupby("pos")
.agg(
n=("phenotype", "size"),
mean_phenotype=("phenotype", "mean"),
median_phenotype=("phenotype", "median"),
std_phenotype=("phenotype", "std")
)
.reset_index()
)
global_mean = feat_df["phenotype"].mean()
pos_summary["mean_effect_vs_global"] = pos_summary["mean_phenotype"] - global_mean
pos_summary["abs_mean_effect"] = pos_summary["mean_effect_vs_global"].abs()
pos_summary = pos_summary.sort_values(["abs_mean_effect", "n"], ascending=[False, False])
pos_summary.to_csv(f"{OUTPUT_DIR}/data/position_summary.csv", index=False)
# Exact substitution token summaries
token_summary = (
mut_df.groupby("token")
.agg(
n=("phenotype", "size"),
pos=("pos", "first"),
wt=("wt", "first"),
mut=("mut", "first"),
mean_phenotype=("phenotype", "mean"),
median_phenotype=("phenotype", "median"),
std_phenotype=("phenotype", "std")
)
.reset_index()
)
token_summary["mean_effect_vs_global"] = token_summary["mean_phenotype"] - global_mean
token_summary["abs_mean_effect"] = token_summary["mean_effect_vs_global"].abs()
token_summary = token_summary.sort_values(["abs_mean_effect", "n"], ascending=[False, False])
token_summary.to_csv(f"{OUTPUT_DIR}/data/token_summary.csv", index=False)
# WT->MUT summaries
sub_summary = (
mut_df.groupby(["wt", "mut"])
.agg(
n=("phenotype", "size"),
mean_phenotype=("phenotype", "mean"),
median_phenotype=("phenotype", "median"),
std_phenotype=("phenotype", "std")
)
.reset_index()
)
sub_summary["mean_effect_vs_global"] = sub_summary["mean_phenotype"] - global_mean
sub_summary["abs_mean_effect"] = sub_summary["mean_effect_vs_global"].abs()
sub_summary = sub_summary.sort_values(["abs_mean_effect", "n"], ascending=[False, False])
sub_summary.to_csv(f"{OUTPUT_DIR}/data/substitution_summary.csv", index=False)
# Single vs multi-mutants
single = feat_df.loc[feat_df["n_mutations"] == 1, "phenotype"]
multi = feat_df.loc[feat_df["n_mutations"] > 1, "phenotype"]
group_tests = []
if len(single) >= MIN_GROUP_SIZE and len(multi) >= MIN_GROUP_SIZE:
stat, p = mannwhitneyu(single, multi, alternative="two-sided")
group_tests.append({
"comparison": "single_vs_multi",
"group1_n": int(len(single)),
"group2_n": int(len(multi)),
"group1_mean": float(single.mean()),
"group2_mean": float(multi.mean()),
"test": "Mann-Whitney U",
"statistic": float(stat),
"p_value": float(p)
})
radical = feat_df.loc[feat_df["fraction_radical"] > 0, "phenotype"]
conservative = feat_df.loc[feat_df["fraction_radical"] == 0, "phenotype"]
if len(radical) >= MIN_GROUP_SIZE and len(conservative) >= MIN_GROUP_SIZE:
stat, p = mannwhitneyu(radical, conservative, alternative="two-sided")
group_tests.append({
"comparison": "radical_vs_conservative",
"group1_n": int(len(radical)),
"group2_n": int(len(conservative)),
"group1_mean": float(radical.mean()),
"group2_mean": float(conservative.mean()),
"test": "Mann-Whitney U",
"statistic": float(stat),
"p_value": float(p)
})
group_df = pd.DataFrame(group_tests)
if len(group_df):
reject, qvals, _, _ = multipletests(group_df["p_value"], method="fdr_bh")
group_df["q_value"] = qvals
group_df["significant_fdr_0.05"] = reject
group_df.to_csv(f"{OUTPUT_DIR}/data/group_tests.csv", index=False)
# Kruskal-Wallis on top sufficiently supported positions
candidate_positions = pos_summary.loc[pos_summary["n"] >= MIN_GROUP_SIZE, "pos"].head(10).tolist()
kw_rows = []
if len(candidate_positions) >= 3:
groups = [mut_df.loc[mut_df["pos"] == p, "phenotype"].values for p in candidate_positions]
if all(len(g) >= MIN_GROUP_SIZE for g in groups):
stat, p = kruskal(*groups)
kw_rows.append({
"test": "kruskal_top_positions",
"n_groups": int(len(groups)),
"positions": ",".join(map(str, candidate_positions)),
"statistic": float(stat),
"p_value": float(p)
})
kw_df = pd.DataFrame(kw_rows)
kw_df.to_csv(f"{OUTPUT_DIR}/data/kruskal_tests.csv", index=False)
# Per-position two-group tests vs rest
position_tests = []
for pos, grp in mut_df.groupby("pos"):
in_pos = grp["phenotype"]
out_pos = mut_df.loc[mut_df["pos"] != pos, "phenotype"]
if len(in_pos) < MIN_GROUP_SIZE or len(out_pos) < MIN_GROUP_SIZE:
continue
stat, p = mannwhitneyu(in_pos, out_pos, alternative="two-sided")
position_tests.append({
"pos": int(pos),
"n_in_pos": int(len(in_pos)),
"n_out_pos": int(len(out_pos)),
"mean_in_pos": float(in_pos.mean()),
"mean_out_pos": float(out_pos.mean()),
"delta_mean": float(in_pos.mean() - out_pos.mean()),
"statistic": float(stat),
"p_value": float(p)
})
position_test_df = pd.DataFrame(position_tests)
if len(position_test_df):
reject, qvals, _, _ = multipletests(position_test_df["p_value"], method="fdr_bh")
position_test_df["q_value"] = qvals
position_test_df["significant_fdr_0.05"] = reject
position_test_df["abs_delta_mean"] = position_test_df["delta_mean"].abs()
position_test_df = position_test_df.sort_values(["q_value", "abs_delta_mean"])
position_test_df.to_csv(f"{OUTPUT_DIR}/data/position_tests_vs_rest.csv", index=False)
summary = {
"n_valid_variants": int(len(feat_df)),
"n_mutation_level_rows": int(len(mut_df)),
"global_mean_phenotype": float(global_mean),
"top_position_by_abs_effect": int(pos_summary.iloc[0]["pos"]) if len(pos_summary) else None,
"top_token_by_abs_effect": str(token_summary.iloc[0]["token"]) if len(token_summary) else None,
"n_positions_tested_vs_rest": int(len(position_test_df)),
"n_significant_positions_fdr_0.05": int(position_test_df["significant_fdr_0.05"].sum()) if len(position_test_df) else 0
}
with open(f"{OUTPUT_DIR}/data/analysis_summary.json", "w") as f:
json.dump(summary, f, indent=2)
print(json.dumps(summary, indent=2))
EOF
python scripts/03_statistical_analysis.py
```
**Validation:** `geno2pheno_stats_output/data/analysis_summary.json` must exist.
---
## Step 6 — Epistasis-like Interaction Screen
For datasets containing multi-mutants, this step performs a simple deterministic
interaction screen by comparing observed multi-mutant phenotype to the average
of its constituent single-mutant effects when those single mutants exist in the
dataset.
```bash
cat > scripts/04_interaction_screen.py << 'EOF'
import json
import pandas as pd
from params import OUTPUT_DIR
with open(f"{OUTPUT_DIR}/data/parsed_rows.json") as f:
rows = json.load(f)
single_map = {}
for row in rows:
if row["n_mutations"] == 1:
token = row["parsed_variants"][0]["token"]
single_map.setdefault(token, []).append(row["phenotype"])
single_mean = {k: sum(v)/len(v) for k, v in single_map.items()}
interaction_rows = []
for row in rows:
if row["n_mutations"] <= 1:
continue
tokens = [m["token"] for m in row["parsed_variants"]]
if not all(t in single_mean for t in tokens):
continue
expected = sum(single_mean[t] for t in tokens) / len(tokens)
observed = row["phenotype"]
interaction_rows.append({
"variant": row["variant"],
"n_mutations": row["n_mutations"],
"observed_phenotype": observed,
"expected_from_single_mean": expected,
"interaction_score": observed - expected,
"abs_interaction_score": abs(observed - expected)
})
interaction_df = pd.DataFrame(interaction_rows)
if len(interaction_df):
interaction_df = interaction_df.sort_values("abs_interaction_score", ascending=False)
interaction_df.to_csv(f"{OUTPUT_DIR}/data/interaction_screen.csv", index=False)
summary = {
"n_multi_mutants_screened": int(len(interaction_df)),
"top_interaction_variant": str(interaction_df.iloc[0]["variant"]) if len(interaction_df) else None
}
with open(f"{OUTPUT_DIR}/data/interaction_summary.json", "w") as f:
json.dump(summary, f, indent=2)
print(json.dumps(summary, indent=2))
EOF
python scripts/04_interaction_screen.py
```
**Validation:** `geno2pheno_stats_output/data/interaction_screen.csv` must exist, even if empty.
---
## Step 7 — Visualizations
Create deterministic figures:
- mutation-count distribution,
- phenotype distribution,
- position-wise mean effect plot,
- substitution heatmap,
- top token effects,
- feature–phenotype scatter plots.
```bash
cat > scripts/05_visualize.py << 'EOF'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from params import OUTPUT_DIR
feat_df = pd.read_csv(f"{OUTPUT_DIR}/data/analysis_features.csv")
pos_df = pd.read_csv(f"{OUTPUT_DIR}/data/position_summary.csv")
sub_df = pd.read_csv(f"{OUTPUT_DIR}/data/substitution_summary.csv")
token_df = pd.read_csv(f"{OUTPUT_DIR}/data/token_summary.csv")
plt.figure(figsize=(6, 4))
feat_df["n_mutations"].hist(bins=20)
plt.xlabel("Number of mutations per variant")
plt.ylabel("Count")
plt.title("Mutation Burden Distribution")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/mutation_burden_hist.png", dpi=150)
plt.figure(figsize=(6, 4))
feat_df["phenotype"].hist(bins=30)
plt.xlabel("Phenotype")
plt.ylabel("Count")
plt.title("Phenotype Distribution")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/phenotype_hist.png", dpi=150)
top_pos = pos_df.sort_values(["abs_mean_effect", "n"], ascending=[False, False]).head(30).sort_values("pos")
plt.figure(figsize=(9, 4))
plt.plot(top_pos["pos"], top_pos["mean_effect_vs_global"], marker="o")
plt.axhline(0, linestyle="--")
plt.xlabel("Position")
plt.ylabel("Mean effect vs global")
plt.title("Top Position Effects")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/top_position_effects.png", dpi=150)
aas = list("ACDEFGHIKLMNPQRSTVWY")
mat = np.full((len(aas), len(aas)), np.nan)
for _, row in sub_df.iterrows():
if row["wt"] in aas and row["mut"] in aas:
i = aas.index(row["wt"])
j = aas.index(row["mut"])
mat[i, j] = row["mean_effect_vs_global"]
plt.figure(figsize=(7, 6))
im = plt.imshow(mat)
plt.xticks(range(len(aas)), aas)
plt.yticks(range(len(aas)), aas)
plt.xlabel("Mutant residue")
plt.ylabel("Wild-type residue")
plt.title("WT→MUT Mean Effect Heatmap")
plt.colorbar(im, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/substitution_heatmap.png", dpi=150)
top_tokens = token_df.head(20).iloc[::-1]
plt.figure(figsize=(8, 6))
plt.barh(top_tokens["token"], top_tokens["mean_effect_vs_global"])
plt.xlabel("Mean effect vs global")
plt.title("Top Exact Substitution Effects")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/top_token_effects.png", dpi=150)
for xcol in ["n_mutations", "sum_delta_charge", "sum_delta_hydrophobic", "mean_abs_delta_mw"]:
plt.figure(figsize=(5, 4))
plt.scatter(feat_df[xcol], feat_df["phenotype"], alpha=0.5)
plt.xlabel(xcol)
plt.ylabel("Phenotype")
plt.title(f"{xcol} vs phenotype")
plt.tight_layout()
plt.savefig(f"{OUTPUT_DIR}/figures/scatter_{xcol}.png", dpi=150)
print("Saved figures.")
EOF
python scripts/05_visualize.py
```
**Validation:** `geno2pheno_stats_output/figures/substitution_heatmap.png` must exist.
---
## Step 8 — Compile Self-contained HTML Report
The report includes:
- assay metadata,
- validation summary,
- key statistical findings,
- top positions and substitutions,
- significance results,
- epistasis-like screen,
- embedded figures.
```bash
cat > scripts/06_report.py << 'EOF'
import json, base64, pathlib
import pandas as pd
from jinja2 import Template
from params import REFERENCE_FASTA, VARIANT_TABLE, PHENOTYPE_COL, OUTPUT_DIR, TOP_N_TABLES
def img_b64(path):
p = pathlib.Path(path)
if not p.exists():
return None
return base64.b64encode(p.read_bytes()).decode()
def df_head_records(path, n=25, sort_cols=None, ascending=None):
p = pathlib.Path(path)
if not p.exists() or p.stat().st_size == 0:
return []
df = pd.read_csv(p)
if sort_cols is not None and len(df):
df = df.sort_values(sort_cols, ascending=ascending)
return df.head(n).to_dict("records")
input_source = json.load(open(f"{OUTPUT_DIR}/data/input_source.json"))
validation = json.load(open(f"{OUTPUT_DIR}/data/validation_summary.json"))
analysis = json.load(open(f"{OUTPUT_DIR}/data/analysis_summary.json"))
interaction = json.load(open(f"{OUTPUT_DIR}/data/interaction_summary.json"))
global_corr = df_head_records(f"{OUTPUT_DIR}/data/global_correlations.csv", n=TOP_N_TABLES)
top_positions = df_head_records(
f"{OUTPUT_DIR}/data/position_summary.csv",
n=TOP_N_TABLES,
sort_cols=["abs_mean_effect", "n"],
ascending=[False, False]
)
top_tokens = df_head_records(
f"{OUTPUT_DIR}/data/token_summary.csv",
n=TOP_N_TABLES,
sort_cols=["abs_mean_effect", "n"],
ascending=[False, False]
)
group_tests = df_head_records(f"{OUTPUT_DIR}/data/group_tests.csv", n=TOP_N_TABLES)
position_tests = df_head_records(
f"{OUTPUT_DIR}/data/position_tests_vs_rest.csv",
n=TOP_N_TABLES,
sort_cols=["q_value", "abs_delta_mean"] if pathlib.Path(f"{OUTPUT_DIR}/data/position_tests_vs_rest.csv").exists() else None,
ascending=[True, False] if pathlib.Path(f"{OUTPUT_DIR}/data/position_tests_vs_rest.csv").exists() else None
)
interaction_rows = df_head_records(
f"{OUTPUT_DIR}/data/interaction_screen.csv",
n=TOP_N_TABLES,
sort_cols=["abs_interaction_score"],
ascending=[False]
)
TEMPLATE = """
<!DOCTYPE html><html><head><meta charset="utf-8">
<title>Genotype–Phenotype Statistical Analysis Report</title>
<style>
body{font-family:Georgia,serif;max-width:1100px;margin:40px auto;line-height:1.55;color:#222}
h1{color:#1a237e} h2{color:#283593;border-bottom:1px solid #ccc;padding-bottom:4px}
table{border-collapse:collapse;width:100%;margin:12px 0}
th,td{border:1px solid #ddd;padding:8px;font-size:14px;vertical-align:top}
th{background:#e8eaf6}
img{max-width:100%;border:1px solid #eee;border-radius:4px}
.grid{display:grid;grid-template-columns:1fr 1fr 1fr;gap:16px;margin:20px 0}
.card{background:#f5f5f5;border-radius:6px;padding:16px;text-align:center}
.stat{font-size:1.8em;font-weight:bold;color:#1565c0}
.label{font-size:0.85em;color:#555}
.small{font-size:0.92em;color:#444}
</style></head><body>
<h1>Genotype–Phenotype Statistical Analysis Report</h1>
<p><strong>Reference FASTA:</strong> {{ reference_fasta }}<br>
<strong>Variant table:</strong> {{ variant_table }}<br>
<strong>Phenotype column:</strong> {{ phenotype_col }}</p>
<h2>0. Input Source</h2>
<table>
<tr><th>Field</th><th>Value</th></tr>
{% for k,v in input_source.items() %}
<tr><td>{{ k }}</td><td>{{ v }}</td></tr>
{% endfor %}
</table>
<div class="grid">
<div class="card"><div class="stat">{{ validation.n_valid_rows }}</div><div class="label">Valid variants</div></div>
<div class="card"><div class="stat">{{ analysis.top_position_by_abs_effect }}</div><div class="label">Top effect position</div></div>
<div class="card"><div class="stat">{{ analysis.n_significant_positions_fdr_0.05 }}</div><div class="label">FDR-significant positions</div></div>
</div>
<h2>1. Validation Summary</h2>
<table>
<tr><th>Field</th><th>Value</th></tr>
{% for k,v in validation.items() %}
<tr><td>{{ k }}</td><td>{{ v }}</td></tr>
{% endfor %}
</table>
<h2>2. Analysis Summary</h2>
<table>
<tr><th>Field</th><th>Value</th></tr>
{% for k,v in analysis.items() %}
<tr><td>{{ k }}</td><td>{{ v }}</td></tr>
{% endfor %}
</table>
<h2>3. Global Feature–Phenotype Correlations</h2>
<table>
{% if global_corr %}
<tr>{% for c in global_corr[0].keys() %}<th>{{ c }}</th>{% endfor %}</tr>
{% for row in global_corr %}
<tr>{% for v in row.values() %}<td>{{ v }}</td>{% endfor %}</tr>
{% endfor %}
{% else %}
<tr><td>No correlation results available.</td></tr>
{% endif %}
</table>
<h2>4. Top Position Effects</h2>
{% if top_position_effects %}<img src="data:image/png;base64,{{ top_position_effects }}" alt="Top position effects">{% endif %}
<table>
{% if top_positions %}
<tr>{% for c in top_positions[0].keys() %}<th>{{ c }}</th>{% endfor %}</tr>
{% for row in top_positions %}
<tr>{% for v in row.values() %}<td>{{ v }}</td>{% endfor %}</tr>
{% endfor %}
{% else %}
<tr><td>No position summary available.</td></tr>
{% endif %}
</table>
<h2>5. Substitution Bias Analysis</h2>
{% if sub_heatmap %}<img src="data:image/png;base64,{{ sub_heatmap }}" alt="Substitution heatmap">{% endif %}
<table>
{% if top_tokens %}
<tr>{% for c in top_tokens[0].keys() %}<th>{{ c }}</th>{% endfor %}</tr>
{% for row in top_tokens %}
<tr>{% for v in row.values() %}<td>{{ v }}</td>{% endfor %}</tr>
{% endfor %}
{% else %}
<tr><td>No token summary available.</td></tr>
{% endif %}
</table>
<h2>6. Group Comparisons</h2>
<table>
{% if group_tests %}
<tr>{% for c in group_tests[0].keys() %}<th>{{ c }}</th>{% endfor %}</tr>
{% for row in group_tests %}
<tr>{% for v in row.values() %}<td>{{ v }}</td>{% endfor %}</tr>
{% endfor %}
{% else %}
<tr><td>No eligible group comparisons were available.</td></tr>
{% endif %}
</table>
<h2>7. Significant Position Screen (position vs rest)</h2>
<table>
{% if position_tests %}
<tr>{% for c in position_tests[0].keys() %}<th>{{ c }}</th>{% endfor %}</tr>
{% for row in position_tests %}
<tr>{% for v in row.values() %}<td>{{ v }}</td>{% endfor %}</tr>
{% endfor %}
{% else %}
<tr><td>No eligible position-level tests were available.</td></tr>
{% endif %}
</table>
<h2>8. Epistasis-like Interaction Screen</h2>
<p class="small">This deterministic screen compares observed multi-mutant phenotype to the mean of constituent single-mutant phenotypes when all constituent singles are available.</p>
<table>
{% if interaction_rows %}
<tr>{% for c in interaction_rows[0].keys() %}<th>{{ c }}</th>{% endfor %}</tr>
{% for row in interaction_rows %}
<tr>{% for v in row.values() %}<td>{{ v }}</td>{% endfor %}</tr>
{% endfor %}
{% else %}
<tr><td>No multi-mutants with all constituent single-mutant references were available.</td></tr>
{% endif %}
</table>
<h2>9. Figures</h2>
{% if burden_hist %}<p><img src="data:image/png;base64,{{ burden_hist }}" alt="Mutation burden histogram"></p>{% endif %}
{% if phenotype_hist %}<p><img src="data:image/png;base64,{{ phenotype_hist }}" alt="Phenotype histogram"></p>{% endif %}
{% if token_effects %}<p><img src="data:image/png;base64,{{ token_effects }}" alt="Top token effects"></p>{% endif %}
{% if scatter_charge %}<p><img src="data:image/png;base64,{{ scatter_charge }}" alt="Charge scatter"></p>{% endif %}
{% if scatter_hydro %}<p><img src="data:image/png;base64,{{ scatter_hydro }}" alt="Hydrophobicity scatter"></p>{% endif %}
{% if scatter_mw %}<p><img src="data:image/png;base64,{{ scatter_mw }}" alt="Molecular weight scatter"></p>{% endif %}
<h2>10. Interpretation Notes</h2>
<p>
This workflow is fully deterministic and contains no model training. It is
designed to surface interpretable statistical structure in genotype–phenotype
assays: positional hotspots, substitution biases, biochemical trends, and simple
interaction candidates. The results are useful for hypothesis generation,
quality auditing, and exploratory biological interpretation, but they do not
replace targeted experimental validation or domain-specific mechanistic models.
</p>
</body></html>
"""
html = Template(TEMPLATE).render(
reference_fasta=REFERENCE_FASTA,
variant_table=VARIANT_TABLE,
phenotype_col=PHENOTYPE_COL,
input_source=input_source,
validation=validation,
analysis=analysis,
global_corr=global_corr,
top_positions=top_positions,
top_tokens=top_tokens,
group_tests=group_tests,
position_tests=position_tests,
interaction_rows=interaction_rows,
burden_hist=img_b64(f"{OUTPUT_DIR}/figures/mutation_burden_hist.png"),
phenotype_hist=img_b64(f"{OUTPUT_DIR}/figures/phenotype_hist.png"),
top_position_effects=img_b64(f"{OUTPUT_DIR}/figures/top_position_effects.png"),
sub_heatmap=img_b64(f"{OUTPUT_DIR}/figures/substitution_heatmap.png"),
token_effects=img_b64(f"{OUTPUT_DIR}/figures/top_token_effects.png"),
scatter_charge=img_b64(f"{OUTPUT_DIR}/figures/scatter_sum_delta_charge.png"),
scatter_hydro=img_b64(f"{OUTPUT_DIR}/figures/scatter_sum_delta_hydrophobic.png"),
scatter_mw=img_b64(f"{OUTPUT_DIR}/figures/scatter_mean_abs_delta_mw.png"),
)
out = f"{OUTPUT_DIR}/report.html"
pathlib.Path(out).write_text(html, encoding="utf-8")
print(f"Report saved to {out}")
EOF
python scripts/06_report.py
```
**Expected final output:** `geno2pheno_stats_output/report.html`
---
## Expected Outputs
```text
reference.fasta
variants.csv
geno2pheno_stats_output/
├── data/
│ ├── input_source.json
│ ├── parsed_rows.json
│ ├── mutation_level_table.csv
│ ├── validation_summary.json
│ ├── validation_problems.json
│ ├── analysis_features.csv
│ ├── global_correlations.csv
│ ├── position_summary.csv
│ ├── token_summary.csv
│ ├── substitution_summary.csv
│ ├── group_tests.csv
│ ├── kruskal_tests.csv
│ ├── position_tests_vs_rest.csv
│ ├── interaction_screen.csv
│ ├── interaction_summary.json
│ └── analysis_summary.json
├── figures/
│ ├── mutation_burden_hist.png
│ ├── phenotype_hist.png
│ ├── top_position_effects.png
│ ├── substitution_heatmap.png
│ ├── top_token_effects.png
│ ├── scatter_n_mutations.png
│ ├── scatter_sum_dDiscussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.