Leucine→Proline Is a Particularly Pathogenic-Enriched Substitution Pair in ClinVar Missense Variants: 66.2% Pathogenic Fraction (Wilson 95% CI [64.7, 67.7]) Across 3,909 Records — A Hydrophobic-Helix-to-Proline-Helix-Disruptor Pair Affecting α-Helical Geometry in Folded Domains
Leucine→Proline Is a Particularly Pathogenic-Enriched Substitution Pair in ClinVar Missense Variants: 66.2% Pathogenic Fraction (Wilson 95% CI [64.7, 67.7]) Across 3,909 Records — A Hydrophobic-Helix-to-Proline-Helix-Disruptor Pair Affecting α-Helical Geometry in Folded Domains
Abstract
We analyze the Leucine → Proline (L → P) substitution pair in ClinVar missense single-nucleotide variants — one of the largest single-pair Pathogenic-fraction effects we observe in the dbNSFP v4 (Liu et al. 2020) annotation of 372,927 ClinVar Pathogenic+Benign records (Landrum et al. 2018) returned by MyVariant.info (Wu et al. 2021). Result: across 3,909 L → P missense records (2,589 Pathogenic + 1,320 Benign), the per-pair Pathogenic fraction is 66.2% (Wilson 95% CI [64.7, 67.7]) — substantially above the corpus-baseline ~28% Pathogenic fraction. Mechanism: Leucine is the most-frequent amino acid in α-helical regions of human proteins (~14% of helix residues; Pace & Scholtz 1998), while Proline is a known α-helix breaker (MacArthur & Thornton 1991) due to the φ-angle constraint imposed by its cyclic side chain. The L → P substitution therefore disrupts α-helix geometry at typically helix-forming Leu positions, with high pathogenic consequence. We provide the full per-target-AA distribution for Leucine-reference substitutions for context: L→P 66.2% [64.7, 67.7]; L→R 65.8% [63.1, 68.4]; L→Q 56.6% [51.8, 61.3]; L→H 53.7% [47.8, 59.6]; L→W 52.5% [45.2, 59.6]; L→S 36.9% [33.5, 40.5]; L→F 24.4% [22.8, 26.1]; L→V 20.1% [18.5, 21.8]; L→M 15.6% [12.6, 19.2]; L→I 12.1% [9.6, 15.0]. The 5.5× per-target-AA range (66.2 / 12.1) within Leucine-reference substitutions reflects the broad chemistry-class spread among Leu's substitution-accessible neighbors. The Pathogenic-skew of L → P (and similarly L → R at 65.8% — charge introduction at hydrophobic core position) defines the high-Pathogenic regime. The Benign-skew of L → I (12.1%) and L → V (20.1%) defines the low-Pathogenic regime — both branched-chain hydrophobic conservative substitutions. For variant-prioritization pipelines: an observed L → P substitution carries a 66% Pathogenic prior, vs L → I at only 12% — a 5.5× per-prior difference within the same reference AA.
1. Background
Leucine (Leu, L) is a hydrophobic branched-chain amino acid with side chain (-CH₂-CH(CH₃)-CH₃). Leu has 6 codons (the most of any amino acid: TTA, TTG, CTT, CTC, CTA, CTG), reflecting its high abundance (~10% of human proteome residues). Functional roles:
- α-helix-forming preference: Leu is the most-frequent residue in α-helices (~14% of helix residues; Pace & Scholtz 1998).
- Hydrophobic core packing: Leu is buried in protein cores at high frequency.
- Membrane-helix anchoring: Leu is enriched in transmembrane α-helices.
- Leucine-zipper coiled-coil motif: heptad-repeat Leu residues at "d" positions in coiled coils.
Proline (Pro, P) is the only proteogenic amino acid with a cyclic side chain (the side chain ring back-bonds to the α-N). Pro has unique φ-angle restrictions that make it an α-helix breaker (MacArthur & Thornton 1991): Pro at internal helix positions destabilizes the helix.
The L → P substitution at typically helix-forming Leu positions therefore introduces a maximally-disruptive residue. This paper measures the per-pair Pathogenic fraction of L → P across ClinVar, with Wilson 95% confidence intervals (Wilson 1927).
2. Method
ClinVar missense (alt ≠ X) variants from MyVariant.info / dbNSFP v4. We focus on ref = L; group by alt AA; require ≥100 total per pair for stable per-pair Pathogenic-fraction estimates with Wilson 95% CI (Wilson 1927; Brown et al. 2001).
3. Results
3.1 The L → P headline finding
| Pair | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| L → P | 2,589 | 1,320 | 3,909 | 66.2% | [64.7, 67.7] |
The L → P pair has 3,909 total records — among the largest single-pair samples in our cache. The Pathogenic-fraction Wilson 95% CI is tight ([64.7, 67.7]) and substantially above the corpus-baseline ~28% Pathogenic.
3.2 Full per-target-AA Pathogenic fraction (sorted descending)
| L → alt | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| L → P | 2,589 | 1,320 | 3,909 | 66.2% | [64.7, 67.7] |
| L → R | 797 | 414 | 1,211 | 65.8% | [63.1, 68.4] |
| L → Q | 231 | 177 | 408 | 56.6% | [51.8, 61.3] |
| L → H | 144 | 124 | 268 | 53.7% | [47.8, 59.6] |
| L → W | 96 | 87 | 183 | 52.5% | [45.2, 59.6] |
| L → S | 271 | 463 | 734 | 36.9% | [33.5, 40.5] |
| L → F | 662 | 2,051 | 2,713 | 24.4% | [22.8, 26.1] |
| L → V | 442 | 1,756 | 2,198 | 20.1% | [18.5, 21.8] |
| L → M | 73 | 395 | 468 | 15.6% | [12.6, 19.2] |
| L → I | 69 | 503 | 572 | 12.1% | [9.6, 15.0] |
The 10 Leu-derived pairs span a 5.5× range (66.2 / 12.1) in Pathogenic fraction.
3.3 The L → P proline-helix-breaker mechanism
L → P at 66.2% Pathogenic is the most Pathogenic Leucine-reference substitution. Mechanism:
- Leucine is α-helix preferred: Leu has the highest helical-propensity P_α index of any amino acid (Pace & Scholtz 1998). Most Leu residues in folded human proteins are in α-helices.
- Proline is α-helix-breaker: Pro's pyrrolidine ring fixes the φ angle to ~−65°, incompatible with the canonical α-helix geometry (φ ≈ −57°). Pro at internal helix positions destabilizes the helix.
- L → P substitutions therefore disrupt α-helix geometry at typically helix-forming positions, with high pathogenic consequence.
The 3,909 records is among the largest single-pair samples; the 66.2% Pathogenic fraction with tight CI [64.7, 67.7] is robust.
3.4 The L → R Pathogenic-enrichment (charge in hydrophobic core)
L → R at 65.8% Pathogenic is nearly identical to L → P. Mechanism: Arg introduces a positive charge at typically-buried hydrophobic Leu positions, requiring desolvation of the charged side chain in a hydrophobic environment — energetically unfavorable. The ~66% Pathogenic fraction reflects this maximum-electrostatic disruption.
3.5 The L → I conservative-class minimum (12.1%)
L → I at 12.1% Pathogenic is the most Benign-skewed Leucine-reference substitution. Mechanism:
- Both Leu (-CH₂-CH(CH₃)-CH₃) and Ile (-CH(CH₃)-CH₂-CH₃) are branched-chain hydrophobic amino acids.
- Both share the same chemical formula (C₆H₁₃NO₂); they are structural isomers differing only in side-chain branching geometry.
- Both prefer α-helical or β-strand secondary structure.
- For most hydrophobic-core-packing positions, L and I are functionally interchangeable.
The high Benign count (503 vs 69 Pathogenic) reflects population-genome variation: L → I is a common population variant.
3.6 The L → V near-conservative substitution (20.1%)
L → V at 20.1% Pathogenic is the second-least-Pathogenic Leu substitution. Val is a smaller branched-chain hydrophobic residue. The 20.1% Pathogenic fraction reflects the subset of Leu positions where the precise side-chain volume matters.
3.7 The chemistry-class continuum
The Leu-derived Pathogenic fractions cluster into 3 tiers:
- Tier 1 — Severely Pathogenic (P-fraction > 50%): L → P (helix-breaker), L → R/Q/H (charge or polar in core), L → W (large aromatic).
- Tier 2 — Mid-range (P-fraction 20–37%): L → S (hydroxyl), L → F (aromatic), L → V (smaller branched).
- Tier 3 — Conservative (P-fraction 12–16%): L → M (sulfur-containing hydrophobic), L → I (isomer).
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 ClinVar curatorial bias
Leu Pathogenic variants are over-reported in disease genes with critical α-helical Leu residues (membrane channels, transcription factors with leucine-zipper domains, structural-protein helical bundles). The L → P 66.2% Pathogenic fraction partly reflects curation focus on these gene families.
4.3 Codon-mutability not normalized
Leu has 6 codons. The per-target-AA mutational rates differ across alt AAs. L → P (CTN → CCN), L → R (CTN → CGN, plus AGR), L → I (CTN → ATN, plus ATA), L → V (CTN → GTN), L → M (CTG → ATG), L → S (TTR/CTN → TCN/AGY), L → F (TTR → TTY), L → Q (CTN → CAR), L → H (CTN → CAY), L → W (TTG → TGG) are accessible by single transitions or transversions.
4.4 Per-isoform first-element AA
We use the first finite element of dbnsfp.aa.ref and dbnsfp.aa.alt. ~5% per-isoform mismatch.
4.5 N-threshold sensitivity
We use ≥100 total per pair. Leu-derived substitutions with < 100 records (L → A, L → G, L → T, L → N, L → K, L → C, L → Y, L → D, L → E) are not analyzed.
4.6 Wilson CI assumes binomial sampling
Per-pair counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.7 ACMG-PP3/BP4 partial circularity
ClinVar Pathogenic / Benign labels are partly predictor-derived (PolyPhen / SIFT scores used as PP3 evidence). Some per-pair fractions reflect predictor-curator co-variance.
5. Implications
- L → P is among the most Pathogenic single substitution pairs in ClinVar at 66.2% (Wilson CI [64.7, 67.7]) — driven by proline's α-helix-breaking property at typically-helical Leu positions.
- L → R at 65.8% is nearly identical — driven by charge introduction at hydrophobic-core Leu positions.
- L → I at 12.1% is the most Benign Leucine substitution — branched-chain isomer chemistry-conservative.
- The 5.5× per-target-AA range within Leucine spans from helix-disrupting (P, R) to chemistry-conservative (I).
- For variant-prioritization pipelines: per-target-AA priors within Leu should be applied; L → P/R ~66%, L → I ~12%.
6. Limitations
- Stop-gain excluded (§4.1).
- ClinVar curatorial bias (§4.2) toward α-helical disease-gene families.
- No codon-mutability normalization (§4.3).
- Per-isoform first-element AA (§4.4).
- N-threshold ≥ 100 (§4.5) excludes 2-step-codon-distance pairs.
- ACMG-PP3 partial circularity (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~60 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-target-AA counts, P-fractions, Wilson 95% CIs. - Verification mode: 6 machine-checkable assertions: (a) all P-fractions in [0, 1]; (b) Wilson CIs contain the point estimate; (c) all 10 reported pairs have N ≥ 100; (d) L→P P-fraction > 0.6; (e) L→I P-fraction < 0.15; (f) sample sizes match input file contents.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- MacArthur, J. W., & Thornton, J. M. (1991). Influence of proline residues on protein conformation. J. Mol. Biol. 218, 397–412.
- Pace, C. N., & Scholtz, J. M. (1998). A helix propensity scale based on experimental studies of peptides and proteins. Biophys. J. 75, 422–427.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Chou, P. Y., & Fasman, G. D. (1978). Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol. 47, 45–148.