Serine→Tryptophan Is the Most Pathogenic-Enriched Serine-Reference Substitution Pair in ClinVar Missense Variants: 56.2% Pathogenic Fraction (Wilson 95% CI [49.3, 62.9]) Across 201 Records — Plus Per-Target-AA Distribution Across the 12 Serine-Reference Substitution Pairs Spanning a 5.7× Range
Serine→Tryptophan Is the Most Pathogenic-Enriched Serine-Reference Substitution Pair in ClinVar Missense Variants: 56.2% Pathogenic Fraction (Wilson 95% CI [49.3, 62.9]) Across 201 Records — Plus Per-Target-AA Distribution Across the 12 Serine-Reference Substitution Pairs Spanning a 5.7× Range
Abstract
We analyze the per-substitution-target-amino-acid Pathogenic fraction for the 12 Serine-reference (Ser, S) substitution pairs with ≥100 ClinVar missense single-nucleotide variants in the dbNSFP v4 (Liu et al. 2020) annotation of 372,927 ClinVar Pathogenic+Benign records (Landrum et al. 2018) returned by MyVariant.info (Wu et al. 2021), with Wilson 95% confidence intervals (Wilson 1927). Stop-gain (aa.alt = X) explicitly excluded. Serine has 12 different alt-AA pairs above the ≥100-record threshold — among the largest per-AA neighbor sets we have observed, reflecting Ser's 6-codon set (TCT, TCC, TCA, TCG, AGT, AGC) and its position in the genetic-code table near many other amino acids. Result: per-target-AA Pathogenic fractions span a 5.7× range from 9.8% (S → A) to 56.2% (S → W) within Serine-reference substitutions: S→W 56.2% Wilson CI [49.3, 62.9]; S→I 35.5% [31.2, 40.1]; S→Y 35.1% [31.3, 39.0]; S→R 33.6% [31.3, 36.0]; S→F 32.9% [30.7, 35.1]; S→P 31.2% [29.1, 33.3]; S→C 19.0% [16.8, 21.3]; S→L 18.9% [17.5, 20.3]; S→N 12.3% [11.1, 13.7]; S→T 11.8% [10.1, 13.8]; S→G 10.2% [8.9, 11.7]; S→A 9.8% [7.8, 12.2]. The chemistry interpretation: the most Pathogenic-enriched alt AAs are tryptophan (small-polar-to-large-aromatic; maximum volume increase), isoleucine and tyrosine (small-polar-to-bulky-or-aromatic). The least Pathogenic-enriched are alanine, glycine, threonine, asparagine — all small or polar substitutions preserving Ser's small-polar character. Ser is a phosphorylation-acceptor residue (Ser/Thr kinase substrates), an O-glycosylation acceptor, and a catalytic residue in serine proteases; substitutions that disrupt the small polar character (S → W, I, Y, F) are pathogenic-enriched, while substitutions preserving it (S → T, A, G, N) are benign-enriched. For variant-prioritization pipelines: Ser substitutions show a moderate 5.7× per-pair range; S → W ~56%, S → A ~10%.
1. Background
Serine (Ser, S) is a polar uncharged amino acid with a small hydroxyl side chain (-CH₂-OH). Ser has 6 codons (the most-degenerate set with Leu and Arg). Functional roles include:
- Phosphorylation acceptor: Ser is the primary substrate for Ser/Thr kinases (PKA, PKC, MAPKs); the hydroxyl is the phosphorylation site.
- O-glycosylation acceptor (mucin O-glycans).
- Catalytic residue in the serine-protease catalytic triad (Ser-His-Asp): Ser195 in chymotrypsin / trypsin / elastase / Factor Xa / thrombin.
- H-bonding networks in active sites and at protein surfaces.
This paper measures the per-target-AA Pathogenic-fraction distribution within the Ser-reference subset.
2. Method
ClinVar missense (alt ≠ X) variants from MyVariant.info / dbNSFP v4. Restrict to ref = S; group by alt AA; require ≥100 total per pair. Wilson 95% CI on the per-pair Pathogenic fraction (Wilson 1927; Brown et al. 2001).
3. Results
3.1 Per-target-AA Pathogenic fraction (sorted descending)
| S → alt | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| S → W | 113 | 88 | 201 | 56.2% | [49.3, 62.9] |
| S → I | 156 | 283 | 439 | 35.5% | [31.2, 40.1] |
| S → Y | 206 | 381 | 587 | 35.1% | [31.3, 39.0] |
| S → R | 515 | 1,019 | 1,534 | 33.6% | [31.3, 36.0] |
| S → F | 596 | 1,217 | 1,813 | 32.9% | [30.7, 35.1] |
| S → P | 569 | 1,256 | 1,825 | 31.2% | [29.1, 33.3] |
| S → C | 216 | 923 | 1,139 | 19.0% | [16.8, 21.3] |
| S → L | 565 | 2,430 | 2,995 | 18.9% | [17.5, 20.3] |
| S → N | 315 | 2,241 | 2,556 | 12.3% | [11.1, 13.7] |
| S → T | 141 | 1,054 | 1,195 | 11.8% | [10.1, 13.8] |
| S → G | 185 | 1,625 | 1,810 | 10.2% | [8.9, 11.7] |
| S → A | 67 | 620 | 687 | 9.8% | [7.8, 12.2] |
The 12 Ser-derived pairs span a 5.7× range (56.2 / 9.8) in Pathogenic fraction.
3.2 The chemistry-class ranking
Tier 1 — Most Pathogenic Ser substitution (P-fraction > 50%):
- S → W (56.2%): Polar-to-large-aromatic. Maximum volume increase among Ser substitutions; introduces the largest amino acid (Trp) at typically-small-polar Ser positions.
Tier 2 — Mid-range Ser substitutions (P-fraction 30–36%):
- S → I (35.5%): Polar-to-branched-chain-hydrophobic.
- S → Y (35.1%): Polar-to-aromatic with hydroxyl. Preserves H-bonding via hydroxyl but adds aromatic ring + bulk.
- S → R (33.6%): Polar-to-charged-basic.
- S → F (32.9%): Polar-to-aromatic without hydroxyl.
- S → P (31.2%): Polar-to-helix-disrupter. Pro introduction breaks helix geometry.
Tier 3 — Less Pathogenic Ser substitutions (P-fraction 18–20%):
- S → C (19.0%): Hydroxyl-to-thiol; introduces reactive sulfhydryl that can form aberrant disulfides.
- S → L (18.9%): Polar-to-hydrophobic-bulky.
Tier 4 — Most Benign Ser substitutions (P-fraction < 13%):
- S → N (12.3%): Polar-to-amide; preserves H-bonding capacity.
- S → T (11.8%): Polar-to-polar; gains methyl group, preserves hydroxyl. Phosphorylation-acceptor preserved.
- S → G (10.2%): Polar-to-flexibility introduction (Gly has no side chain).
- S → A (9.8%): Polar-to-small-aliphatic; loses hydroxyl, gains methyl.
3.3 The S → A conservative-class minimum
S → A at 9.8% Pathogenic is the least Pathogenic Serine-reference substitution. Mechanism:
- Ser (-CH₂-OH) and Ala (-CH₃) both have small side chains.
- Ala loses Ser's hydroxyl group entirely (no H-bond donor/acceptor; no phosphorylation acceptor).
- For the subset of Ser positions where the hydroxyl is non-functional (e.g., flexible-loop Ser residues without phosphorylation/H-bonding role), Ala substitution is functionally interchangeable.
- The 9.8% Pathogenic fraction reflects the subset where the hydroxyl matters (e.g., Ser-protease catalytic residues, Ser/Thr kinase phosphorylation sites).
The high Benign count (620 vs 67 Pathogenic) reflects population-genome variation: S → A is a common population variant.
3.4 The S → T phosphorylation-preserving substitution (11.8%)
S → T at 11.8% Pathogenic is the second-least-Pathogenic Ser-derived substitution. Mechanism: Thr is essentially Ser with a methyl group on the β-carbon. Both Ser and Thr have hydroxyl side chains and serve as Ser/Thr kinase substrates. For most kinase phosphorylation sites, S → T is functionally interchangeable.
The very low 11.8% Pathogenic fraction reflects the high population-frequency of S ↔ T variation in non-essential Ser positions.
3.5 The S → W Pathogenic-enriched extreme (56.2%)
S → W at 56.2% Pathogenic is the most Pathogenic Ser-derived substitution. Mechanism: Trp is the largest amino acid (~180 ų side-chain volume vs Ser's ~30 ų — a 6× volume increase). Introducing Trp at a typically-small-polar Ser position causes major steric clash with the surrounding structure. The aromatic indole side chain also disrupts polar H-bonding networks.
The S → W substitution requires a 2-step codon transition (TCN → TGG); it is relatively rare (201 records). The Pathogenic fraction is high reflecting that the substitution is severely disruptive.
3.6 The serine-protease catalytic triad context
Many Ser Pathogenic variants come from serine-protease genes (chymotrypsin family, complement system C1S/C1R, coagulation factors II/VII/IX/X, plasminogen, urokinase, kallikreins). The catalytic Ser195 in these enzymes is essential for hydrolytic activity; substitutions abolish enzyme function. The high Pathogenic fraction at the catalytic-Ser positions contributes to the per-pair Pathogenic fractions for the bulky-alt-AA substitutions (S → W, I, Y, F).
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 ClinVar curatorial bias
Ser Pathogenic variants are over-reported in disease genes with critical Ser-functional residues — serine-protease genes, kinase substrates with Ser-phosphorylation sites, mucins with O-glycosylation Ser residues.
4.3 No phosphorylation-site annotation stratification
We do not stratify Ser residues by phosphorylation-site annotation (e.g., PhosphoSitePlus). A complementary analysis using known phosphorylation sites would refine the per-pair signal.
4.4 Codon-mutability not normalized
Ser has 6 codons (TCT, TCC, TCA, TCG, AGT, AGC) — the most-degenerate amino acid set. Per-target-AA mutational rates differ across the 12 alt AAs reported.
4.5 Per-isoform first-element AA
We use the first finite element of dbnsfp.aa.ref and dbnsfp.aa.alt. ~5% per-isoform mismatch.
4.6 N-threshold sensitivity
We use ≥100 total per pair. Ser-derived substitutions with < 100 records (S → V, S → M, S → H, S → Q, S → K, S → D, S → E) are not analyzed.
4.7 Wilson CI assumes binomial sampling
Per-pair counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.8 ACMG-PP3/BP4 partial circularity
ClinVar Pathogenic / Benign labels are partly predictor-derived (PolyPhen / SIFT scores used as PP3 evidence). Some per-pair fractions reflect predictor-curator co-variance.
5. Implications
- Among 12 Ser-derived substitution pairs, S → W is the most Pathogenic-enriched at 56.2% (Wilson CI [49.3, 62.9]) — driven by maximum volume increase and aromatic-ring introduction.
- S → A is the least Pathogenic-enriched at 9.8% [7.8, 12.2] — small-polar-to-small-aliphatic substitution.
- The 5.7× per-target-AA range within Serine spans 12 alt-AA pairs, reflecting Ser's degenerate 6-codon set and the broad chemistry diversity of accessible substitutions.
- For variant-prioritization pipelines: per-target-AA priors within Ser should be applied; S → W ~56%, S → A ~10%.
- The Ser-derived bottom 4 pairs (S → A/G/T/N at 9.8–12.3%) are uniformly very-Benign, consistent with Ser positions tolerating any small or polar alt residue.
6. Limitations
- Stop-gain excluded (§4.1).
- ClinVar curatorial bias (§4.2) toward serine-protease and Ser-phosphorylation gene families.
- No phosphorylation-site annotation stratification (§4.3).
- No codon-mutability normalization (§4.4).
- Per-isoform first-element AA (§4.5).
- N-threshold ≥ 100 (§4.6).
- ACMG-PP3 partial circularity (§4.8).
7. Reproducibility
- Script:
analyze.js(Node.js, ~60 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-target-AA counts, P-fractions, Wilson 95% CIs. - Verification mode: 6 machine-checkable assertions: (a) all P-fractions in [0, 1]; (b) Wilson CIs contain the point estimate; (c) all 12 reported pairs have N ≥ 100; (d) S→W P-fraction > 0.5; (e) S→A P-fraction < 0.13; (f) sample sizes match input file contents.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Hedstrom, L. (2002). Serine protease mechanism and specificity. Chem. Rev. 102, 4501–4524.
- Hornbeck, P. V., et al. (2015). PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res. 43, D512–D520.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Songyang, Z., et al. (1996). Use of an oriented peptide library to determine the optimal substrates of protein kinases. Curr. Biol. 4, 973–982.