Valine→Aspartate Is the Most Pathogenic-Enriched Valine-Reference Substitution Pair in ClinVar Missense Variants: 68.5% Pathogenic Fraction (Wilson 95% CI [63.6, 73.1]) Across 362 Records — Plus Per-Target-AA Distribution Across the 8 Valine-Reference Substitution Pairs
Valine→Aspartate Is the Most Pathogenic-Enriched Valine-Reference Substitution Pair in ClinVar Missense Variants: 68.5% Pathogenic Fraction (Wilson 95% CI [63.6, 73.1]) Across 362 Records — Plus Per-Target-AA Distribution Across the 8 Valine-Reference Substitution Pairs
Abstract
We analyze the per-substitution-target-amino-acid Pathogenic fraction for the 8 Valine-reference (Val, V) substitution pairs with ≥100 ClinVar missense single-nucleotide variants in the dbNSFP v4 (Liu et al. 2020) annotation of 372,927 ClinVar Pathogenic+Benign records (Landrum et al. 2018) returned by MyVariant.info (Wu et al. 2021), with Wilson 95% confidence intervals (Wilson 1927). Stop-gain (aa.alt = X) explicitly excluded. Result: per-target-AA Pathogenic fractions span a 17.4× range from 3.9% (V → I) to 68.5% (V → D) within Valine-reference substitutions: V→D 68.5% Wilson CI [63.6, 73.1]; V→E 65.4% [60.3, 70.1]; V→G 54.5% [51.0, 58.0]; V→F 42.8% [39.1, 46.5]; V→L 20.1% [18.5, 21.8]; V→A 18.2% [16.8, 19.8]; V→M 16.4% [15.4, 17.5]; V→I 3.9% [3.5, 4.4]. The chemistry interpretation: the most Pathogenic-enriched alt AAs are aspartate and glutamate — both introduce a -1 charge into the typically-buried hydrophobic Val position. Glycine and phenylalanine follow in mid-range. The least Pathogenic-enriched alt AAs are isoleucine, methionine, alanine, leucine — all hydrophobic substitutions preserving the side-chain character. The V → I substitution at 3.9% Pathogenic is notably the lowest among V-derived pairs and is consistent with V → I being a chemistry-conservative branched-chain hydrophobic-to-hydrophobic substitution (the reverse direction of the previously-published I → V analysis at 4.8%). Across 7,253 V → I records (282 Pathogenic + 6,971 Benign), the substitution is benign in ~96% of observed cases. For variant-prioritization pipelines: per-target-AA priors within Valine span a 17.4× range; V → D ~68.5%, V → I ~3.9%. Valine is a hydrophobic-core branched-chain residue; substitutions that introduce charge or polarity at typically-buried positions are pathogenic; substitutions preserving hydrophobic character are benign-enriched.
1. Background
Valine (Val, V) is a branched-chain hydrophobic amino acid with side chain (-CH(CH₃)-CH₃; one CH₂ shorter than Ile). Val is one of three branched-chain amino acids (with Ile and Leu); the three are biochemically interchangeable in many positions. Val is the third-most-common amino acid in α-helices (after Leu and Ala) and occurs frequently in β-strands. Functional roles:
- Hydrophobic core packing in folded proteins; Val typically buried.
- Membrane-anchoring residues in transmembrane helices.
- β-strand-forming preference in β-sheet structures.
This paper measures the per-target-AA Pathogenic-fraction distribution within the Val-reference subset.
2. Method
ClinVar missense (alt ≠ X) variants from MyVariant.info / dbNSFP v4. Restrict to ref = V; group by alt AA; require ≥100 total per pair. Wilson 95% CI on the per-pair Pathogenic fraction (Wilson 1927; Brown et al. 2001).
3. Results
3.1 Per-target-AA Pathogenic fraction (sorted descending)
| V → alt | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| V → D | 248 | 114 | 362 | 68.5% | [63.6, 73.1] |
| V → E | 234 | 124 | 358 | 65.4% | [60.3, 70.1] |
| V → G | 420 | 350 | 770 | 54.5% | [51.0, 58.0] |
| V → F | 289 | 387 | 676 | 42.8% | [39.1, 46.5] |
| V → L | 472 | 1,875 | 2,347 | 20.1% | [18.5, 21.8] |
| V → A | 446 | 1,998 | 2,444 | 18.2% | [16.8, 19.8] |
| V → M | 773 | 3,940 | 4,713 | 16.4% | [15.4, 17.5] |
| V → I | 282 | 6,971 | 7,253 | 3.9% | [3.5, 4.4] |
The 8 Val-derived pairs span a 17.4× range (68.5 / 3.9) — the broadest single-reference-AA range among the analyses we have published so far.
3.2 The chemistry-class ranking
Tier 1 — Most Pathogenic Val substitutions (P-fraction > 50%):
- V → D (68.5%): Hydrophobic-to-acidic. Maximum electrostatic disruption at typically-buried hydrophobic position.
- V → E (65.4%): Hydrophobic-to-acidic (with one extra CH₂). Same mechanism as V → D.
- V → G (54.5%): Hydrophobic-to-flexibility introduction. Disrupts hydrophobic packing.
Tier 2 — Mid-range Val substitution (P-fraction 40–45%):
- V → F (42.8%): Hydrophobic-to-aromatic; preserves hydrophobicity but changes geometry to bulky aromatic ring.
Tier 3 — Less Pathogenic Val substitutions (P-fraction 16–21%):
- V → L (20.1%): Branched-chain isomer (Leu has the same chemical formula as Val + one CH₂).
- V → A (18.2%): Hydrophobic-to-smaller-hydrophobic (Ala has one less CH(CH₃) group).
- V → M (16.4%): Hydrophobic-to-sulfur-containing-hydrophobic. Preserves hydrophobicity.
Tier 4 — Most Benign Val substitution (P-fraction < 5%):
- V → I (3.9%): Branched-chain isomer (Ile has the same chemical formula as Val + one CH₂). The most chemistry-conservative V-derived substitution.
3.3 The V → D / V → E charge-introduction extremes
V → D at 68.5% Pathogenic and V → E at 65.4% are the most Pathogenic Val substitutions. Mechanism: Val is typically buried in hydrophobic protein cores. Introducing a charged side chain (Asp -1 or Glu -1) at a buried position requires desolvation of the charged side chain in a hydrophobic environment — energetically unfavorable by ~5–10 kcal/mol. The protein either misfolds or destabilizes, with high pathogenic consequence.
This is consistent with the well-known "buried charge" rule in protein biophysics: charged residues at buried positions are rare in evolutionary-stable proteins.
3.4 The V → I conservative-class minimum
V → I at 3.9% Pathogenic is the most Benign-skewed Valine-reference substitution. Mechanism:
- Val (-CH(CH₃)-CH₃) and Ile (-CH(CH₃)-CH₂-CH₃) are branched-chain hydrophobic amino acids.
- The chemistry change is the addition of one CH₂ group (Ile is larger).
- For most hydrophobic-core-packing positions, V and I are functionally interchangeable.
The high Benign count (6,971 vs only 282 Pathogenic) reflects population-genome variation: V → I is a common population variant in many genes.
3.5 The V → A / V → M / V → L cluster (hydrophobic-to-hydrophobic)
V → A (18.2%), V → M (16.4%), V → L (20.1%) all preserve the hydrophobic character. The 16–21% Pathogenic fractions cluster together, reflecting that hydrophobic substitutions for Val are well-tolerated but with a small subset (~15–20%) of disruptive cases at functionally-constrained positions.
3.6 Mean relative position is similar across pairs
All 8 V-derived pairs have mean relative position 0.44–0.52 (close to uniform 0.50). There is no per-pair position bias for Val-reference Pathogenic variants. Val residues are uniformly distributed along human proteins.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 ClinVar curatorial bias
Val Pathogenic variants are over-reported in disease genes with critical hydrophobic-core Val residues (membrane channels, structural proteins, enzymes with hydrophobic substrate-binding pockets). The per-pair Pathogenic fractions partly reflect curation focus on these gene families.
4.3 Codon-mutability not normalized
Val has 4 codons (GTT, GTC, GTA, GTG). The per-target-AA mutational rates differ across the 8 alt AAs reported. V → I (GTN → ATN), V → A (GTN → GCN), V → L (GTN → TTR / CTN), V → M (GTG → ATG), V → F (GTN → TTN), V → G (GTN → GGN), V → D (GTN → GAY), V → E (GTN → GAR) are accessible by single transitions or transversions.
4.4 Per-isoform first-element AA
We use the first finite element of dbnsfp.aa.ref and dbnsfp.aa.alt. ~5% per-isoform mismatch.
4.5 N-threshold sensitivity
We use ≥100 total per pair. Val-derived substitutions with < 100 records (V → S, V → T, V → N, V → Q, V → K, V → R, V → H, V → W, V → Y, V → C, V → P) are not analyzed.
4.6 Wilson CI assumes binomial sampling
Per-pair counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.7 ACMG-PP3/BP4 partial circularity
ClinVar Pathogenic / Benign labels are partly predictor-derived (PolyPhen / SIFT scores used as PP3 evidence). Some per-pair fractions reflect predictor-curator co-variance.
5. Implications
- Among 8 Val-derived substitution pairs, V → D is the most Pathogenic-enriched at 68.5% (Wilson CI [63.6, 73.1]) — driven by charge introduction at typically-buried hydrophobic positions.
- V → I is the least Pathogenic-enriched at 3.9% [3.5, 4.4] — a conservative branched-chain isomer substitution.
- The 17.4× per-target-AA range within Valine is the broadest single-reference-AA range we have reported.
- The 4 hydrophobic-preserving V substitutions (I, M, A, L) cluster at 4–20% Pathogenic; the 2 charged substitutions (D, E) cluster at 65–69% Pathogenic.
- For variant-prioritization pipelines: per-target-AA priors within Val should be applied; V → D/E ~65–69%, V → I ~4%.
6. Limitations
- Stop-gain excluded (§4.1).
- ClinVar curatorial bias (§4.2) toward hydrophobic-core gene families.
- No codon-mutability normalization (§4.3).
- Per-isoform first-element AA (§4.4).
- N-threshold ≥ 100 (§4.5) excludes 2-step-codon-distance pairs.
- ACMG-PP3 partial circularity (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~60 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-target-AA counts, P-fractions, Wilson 95% CIs, mean relative positions. - Verification mode: 6 machine-checkable assertions: (a) all P-fractions in [0, 1]; (b) Wilson CIs contain the point estimate; (c) all 8 reported pairs have N ≥ 100; (d) V→D P-fraction > 0.6; (e) V→I P-fraction < 0.05; (f) sample sizes match input file contents.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Honig, B., & Yang, A.-S. (1995). Free energy balance in protein folding. Adv. Protein Chem. 46, 27–58. (Buried-charge energetic-cost reference.)
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Henikoff, S., & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. PNAS 89, 10915–10919.
- Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862–864.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.