Tyrosine→Aspartate Is the Most Pathogenic-Enriched Tyrosine-Reference Substitution Pair in ClinVar Missense Variants: 73.4% Pathogenic Fraction (Wilson 95% CI [68.2, 78.0]) Across 308 Records — Plus Per-Target-AA Distribution Across the 6 Tyrosine-Reference Substitution Pairs
Tyrosine→Aspartate Is the Most Pathogenic-Enriched Tyrosine-Reference Substitution Pair in ClinVar Missense Variants: 73.4% Pathogenic Fraction (Wilson 95% CI [68.2, 78.0]) Across 308 Records — Plus Per-Target-AA Distribution Across the 6 Tyrosine-Reference Substitution Pairs
Abstract
We analyze the per-substitution-target-amino-acid Pathogenic fraction for the 6 Tyrosine-reference (Tyr, Y) substitution pairs with ≥100 ClinVar missense single-nucleotide variants in the dbNSFP v4 (Liu et al. 2020) annotation of 372,927 ClinVar Pathogenic+Benign records (Landrum et al. 2018) returned by MyVariant.info (Wu et al. 2021), with Wilson 95% confidence intervals (Wilson 1927). Stop-gain (aa.alt = X) explicitly excluded. Result: per-target-AA Pathogenic fractions span a 3.80× range from 19.3% (Y → F) to 73.4% (Y → D) within Tyrosine-reference substitutions: Y→D 73.4% Wilson CI [68.2, 78.0]; Y→S 69.0% [63.7, 73.8]; Y→N 66.9% [60.8, 72.5]; Y→C 49.0% [47.1, 51.0]; Y→H 41.9% [39.1, 44.7]; Y→F 19.3% [15.3, 24.1]. The chemistry interpretation: the most Pathogenic-enriched alt AAs are aspartate (aromatic-to-acidic; charge introduction at typically-buried Tyr position) and serine / asparagine (aromatic-to-polar; ring removal). The least Pathogenic-enriched is phenylalanine — the chemistry-conservative aromatic-to-aromatic substitution preserving the ring structure (Phe is Tyr without the para-hydroxyl). Y → F is the reciprocal of F → Y (which we previously analyzed at 26.0%); within ClinVar, Y → F is even more Benign-skewed at 19.3%, consistent with Phe's loss of just the hydroxyl being less disruptive than Tyr's gain of the hydroxyl (Tyr's hydroxyl can be a phosphorylation site for Tyr kinases). For variant-prioritization pipelines: per-target-AA priors within Tyrosine span a 3.80× range; Y → D ~73%, Y → F ~19%. Notably, Y → C at 49% Pathogenic falls in the middle: Cys introduces a thiol that can form aberrant disulfides + removes the aromatic ring; the moderate Pathogenicity reflects the dual mechanism (some Tyr positions tolerate Cys substitution, others do not).
1. Background
Tyrosine (Tyr, Y) is an aromatic amino acid with side chain (-CH₂-C₆H₄-OH; phenolic group). Tyr is one of three aromatic amino acids (with Phe and Trp); the three are biochemically related and often interchangeable in aromatic-stacking positions. Tyr is unique among the three for having a para-hydroxyl group, which provides:
- Tyrosine kinase phosphorylation acceptor: Tyr is the substrate residue for Tyr kinases (e.g., EGFR, JAK, SRC family); the para-hydroxyl is the phosphorylation site.
- Aromatic-aromatic stacking interactions (similar to Phe).
- H-bonding capability through the para-hydroxyl (unique to Tyr among the aromatics).
- Catalytic residue in some active sites (e.g., RNases, peroxidases, photosynthetic reaction centers).
This paper measures the per-target-AA Pathogenic-fraction distribution within the Tyr-reference subset.
2. Method
ClinVar missense (alt ≠ X) variants from MyVariant.info / dbNSFP v4. Restrict to ref = Y; group by alt AA; require ≥100 total per pair. Wilson 95% CI on the per-pair Pathogenic fraction (Wilson 1927; Brown et al. 2001).
3. Results
3.1 Per-target-AA Pathogenic fraction (sorted descending)
| Y → alt | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| Y → D | 226 | 82 | 308 | 73.4% | [68.2, 78.0] |
| Y → S | 220 | 99 | 319 | 69.0% | [63.7, 73.8] |
| Y → N | 164 | 81 | 245 | 66.9% | [60.8, 72.5] |
| Y → C | 1,268 | 1,318 | 2,586 | 49.0% | [47.1, 51.0] |
| Y → H | 485 | 673 | 1,158 | 41.9% | [39.1, 44.7] |
| Y → F | 59 | 246 | 305 | 19.3% | [15.3, 24.1] |
The 6 Tyr-derived pairs span a 3.80× range (73.4 / 19.3) in Pathogenic fraction.
3.2 The chemistry-class ranking
Tier 1 — Most Pathogenic Tyr substitutions (P-fraction > 65%):
- Y → D (73.4%): Aromatic-to-acidic. Maximum chemistry disruption: removes aromatic ring, introduces -1 charge. Tyrosine kinase substrate site is destroyed.
- Y → S (69.0%): Aromatic-to-small-polar. Removes aromatic ring; preserves H-bond donor through hydroxyl but not at the same geometry.
- Y → N (66.9%): Aromatic-to-amide. Removes aromatic ring; introduces amide H-bonding.
Tier 2 — Mid-range Tyr substitutions (P-fraction 40–50%):
- Y → C (49.0%): Aromatic-to-thiol. Removes aromatic ring; introduces reactive sulfhydryl that can form aberrant disulfides.
- Y → H (41.9%): Aromatic-to-aromatic-imidazole. Preserves aromatic ring (His's imidazole is also aromatic) but loses para-hydroxyl and gains partial-positive charge.
Tier 3 — Most Benign Tyr substitution (P-fraction < 25%):
- Y → F (19.3%): Aromatic-to-aromatic. The chemistry-conservative substitution preserving the ring structure (Phe is Tyr without the para-hydroxyl). Most chemistry-conservative Y-derived substitution.
3.3 The Y → F conservative aromatic-class minimum
Y → F at 19.3% Pathogenic is the least Pathogenic Tyrosine-reference substitution. Mechanism:
- Both Tyr (-CH₂-C₆H₄-OH) and Phe (-CH₂-C₆H₅) carry aromatic benzene-ring side chains.
- Phe is essentially Tyr without the para-hydroxyl.
- For most aromatic-stacking positions, Y and F are functionally interchangeable.
- The hydroxyl is functionally important only at Tyr-kinase phosphorylation sites and at some catalytic / metal-coordinating positions.
The 19.3% Pathogenic fraction reflects the subset of Tyr positions where the para-hydroxyl is functionally essential (phosphorylation sites; catalytic Tyr residues; metal-coordinating Tyr residues).
The Y → F Pathogenic fraction (19.3%) is even lower than the reciprocal F → Y Pathogenic fraction (26.0% from companion analyses). Mechanistically: removing the hydroxyl (Y → F) is less disruptive than adding it (F → Y), because gaining the hydroxyl can introduce steric or H-bonding incompatibilities at positions evolved for the hydroxyl-free Phe.
3.4 The Y → D Pathogenic-enriched signal
Y → D at 73.4% Pathogenic is the most Pathogenic Tyrosine-reference substitution. Mechanism:
- Aromatic ring removed; small acidic side chain introduced.
- For tyrosine kinase substrate sites, the substitution destroys the phosphorylation acceptor.
- For aromatic-stacking interfaces, the substitution destroys the stacking interaction.
- The introduced -1 charge may also be incompatible with hydrophobic core packing.
The combined mechanisms produce the high 73.4% Pathogenic fraction.
3.5 The Y → C dual-mechanism (49.0%)
Y → C at 49.0% Pathogenic is intermediate. Mechanism: Cys removes the aromatic ring + introduces a reactive sulfhydryl group that can form aberrant disulfide bonds with nearby Cys residues. The two mechanisms (aromatic-stacking loss + aberrant-disulfide formation) compete: at some Tyr positions the aromatic-stacking loss is the dominant effect; at others the aberrant disulfide is the dominant effect. The 49.0% Pathogenic fraction reflects this dual-mechanism averaging.
The relatively high N (2,586 records) reflects that Y → C is a common substitution: the Tyr codon (TAY) and Cys codon (TGY) differ by one nucleotide (A → G) at the second position, a common mutational transition.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 ClinVar curatorial bias
Tyr Pathogenic variants are over-reported in disease genes with critical Tyr-functional residues — Tyr kinases (RTK family: EGFR, HER2, FLT3), receptor Tyr kinase substrates (insulin receptor, IGF receptor), Tyr-phosphatase substrates, melanin-synthesis tyrosinase, PKU-related phenylalanine hydroxylase. The per-pair Pathogenic fractions partly reflect curation focus on these gene families.
4.3 Codon-mutability not normalized
Tyr has 2 codons (TAT, TAC). The per-target-AA mutational rates differ across the 6 alt AAs reported. Y → C (TAY → TGY), Y → H (TAY → CAY), Y → D (TAY → GAY), Y → N (TAY → AAY), Y → F (TAY → TTY), Y → S (TAY → TCY) are accessible by single transitions or transversions.
4.4 Per-isoform first-element AA
We use the first finite element of dbnsfp.aa.ref and dbnsfp.aa.alt. ~5% per-isoform mismatch.
4.5 N-threshold sensitivity
We use ≥100 total per pair. Tyr-derived substitutions with < 100 records (Y → A, Y → V, Y → L, Y → I, Y → M, Y → T, Y → Q, Y → K, Y → R, Y → G, Y → P, Y → W, Y → E) are not analyzed.
4.6 Wilson CI assumes binomial sampling
Per-pair counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.7 ACMG-PP3/BP4 partial circularity
ClinVar Pathogenic / Benign labels are partly predictor-derived (PolyPhen / SIFT scores used as PP3 evidence). Some per-pair fractions reflect predictor-curator co-variance.
5. Implications
- Among 6 Tyr-derived substitution pairs, Y → D is the most Pathogenic-enriched at 73.4% (Wilson CI [68.2, 78.0]) — driven by aromatic-ring removal + charge introduction.
- Y → F is the least Pathogenic-enriched at 19.3% [15.3, 24.1] — a conservative aromatic-to-aromatic substitution.
- The 3.80× per-target-AA range within Tyrosine spans from severe disruption (Y → D) to chemistry-conservative (Y → F).
- For variant-prioritization pipelines: per-target-AA priors within Tyr should be applied; Y → D ~73%, Y → F ~19%.
- Y → C at 49.0% reflects dual mechanism: aromatic-stacking loss + aberrant-disulfide formation; the moderate Pathogenicity is the average across the two competing effects.
6. Limitations
- Stop-gain excluded (§4.1).
- ClinVar curatorial bias (§4.2) toward Tyr-kinase / kinase-substrate gene families.
- No codon-mutability normalization (§4.3).
- Per-isoform first-element AA (§4.4).
- N-threshold ≥ 100 (§4.5) excludes 2-step-codon-distance pairs.
- ACMG-PP3 partial circularity (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~60 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-target-AA counts, P-fractions, Wilson 95% CIs. - Verification mode: 6 machine-checkable assertions: (a) all P-fractions in [0, 1]; (b) Wilson CIs contain the point estimate; (c) all 6 reported pairs have N ≥ 100; (d) Y→D P-fraction > 0.7; (e) Y→F P-fraction < 0.25; (f) sample sizes match input file contents.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Hubbard, S. R., & Till, J. H. (2000). Protein tyrosine kinase structure and function. Annu. Rev. Biochem. 69, 373–398.
- Lemmon, M. A., & Schlessinger, J. (2010). Cell signaling by receptor tyrosine kinases. Cell 141, 1117–1134.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Burley, S. K., & Petsko, G. A. (1985). Aromatic-aromatic interaction: a mechanism of protein structure stabilization. Science 229, 23–28.