Among 7 Asparagine-Reference Substitution Pairs in ClinVar Missense Variants With ≥100 Records: Asn→Ile Is the Most Pathogenic-Enriched (48.9% Pathogenic, Wilson 95% CI [44.1, 53.7]) and Asn→Ser Is the Least (12.4% [11.5, 13.3]) — A 3.95× Range Within the Polar-Amide Reference Amino Acid
Among 7 Asparagine-Reference Substitution Pairs in ClinVar Missense Variants With ≥100 Records: Asn→Ile Is the Most Pathogenic-Enriched (48.9% Pathogenic, Wilson 95% CI [44.1, 53.7]) and Asn→Ser Is the Least (12.4% [11.5, 13.3]) — A 3.95× Range Within the Polar-Amide Reference Amino Acid
Abstract
We compute the per-substitution-target-amino-acid Pathogenic fraction for the 7 Asparagine-reference (Asn, N) substitution pairs with ≥100 ClinVar missense single-nucleotide variants in the dbNSFP v4 (Liu et al. 2020) annotation of 372,927 ClinVar Pathogenic+Benign records (Landrum et al. 2018) returned by MyVariant.info (Wu et al. 2021), with Wilson 95% confidence intervals (Wilson 1927). Stop-gain (aa.alt = X) explicitly excluded. Result: per-target-AA Pathogenic fractions span a 3.95× range from 12.4% (N → S) to 48.9% (N → I): N→I 48.9% Wilson CI [44.1, 53.7]; N→Y 43.4% [37.6, 49.3]; N→K 40.9% [38.4, 43.4]; N→T 27.5% [23.4, 32.0]; N→H 26.4% [22.7, 30.6]; N→D 24.4% [22.2, 26.8]; N→S 12.4% [11.5, 13.3]. The chemistry interpretation: the most Pathogenic-enriched alt AAs are isoleucine (chemistry-disrupting bulky branched-chain hydrophobic), tyrosine (large aromatic ring), and lysine (charge introduction, polar amide → basic). The least Pathogenic-enriched is serine — a smaller polar substitution preserving H-bonding capacity through hydroxyl. The N → S substitution at 12.4% Pathogenic is among the most-Benign single-pair Pathogenic priors observed in ClinVar — N → S is essentially a hydroxyl-amide swap, a chemistry-conservative substitution. For variant-prioritization pipelines: per-target-AA priors within Asparagine span a 3.95× range from 12.4% (N → S) to 48.9% (N → I); Asparagine substitutions show the broadest chemistry-driven variability we have observed in per-AA analyses. Asparagine is a polar uncharged amino acid commonly found in N-glycosylation sequons (N-X-S/T) where the substitution N → I/Y/K disrupts the sequon while N → S preserves the polar character.
1. Background
Asparagine (Asn, N) is a polar uncharged amino acid with a primary amide side chain (-CH₂-CO-NH₂). Asn side-chain pK_a is non-titrable in standard pH range; the amide is electrically neutral but contributes H-bond donors and acceptors. Functional roles include:
- N-glycosylation sequon (N-X-S/T): Asn is the glycan-attachment residue in the canonical N-glycosylation motif (Bause 1983); the consensus N-X-S/T (with X being any AA except P) must be intact for glycosylation. Substitutions of the Asn (i.e., N → other) typically abolish glycosylation.
- Active-site catalysis: Asn participates in H-bonding networks at enzyme active sites (e.g., the catalytic Asn of asparaginase; many oxidoreductases).
- Asparagine deamidation: Asn spontaneously deamidates to Asp at physiological pH and elevated temperatures, providing a potential source of variant interpretation in long-lived proteins.
This paper measures the per-target-AA Pathogenic-fraction distribution within the Asn-reference subset.
2. Method
ClinVar missense (alt ≠ X) variants from MyVariant.info / dbNSFP v4. Restrict to ref = N; group by alt AA; require ≥100 total per pair. Wilson 95% CI on the per-pair Pathogenic fraction.
3. Results
3.1 Per-target-AA Pathogenic fraction (sorted descending)
| N → alt | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| N → I | 198 | 207 | 405 | 48.9% | [44.1, 53.7] |
| N → Y | 118 | 154 | 272 | 43.4% | [37.6, 49.3] |
| N → K | 595 | 860 | 1,455 | 40.9% | [38.4, 43.4] |
| N → T | 114 | 301 | 415 | 27.5% | [23.4, 32.0] |
| N → H | 125 | 348 | 473 | 26.4% | [22.7, 30.6] |
| N → D | 334 | 1,034 | 1,368 | 24.4% | [22.2, 26.8] |
| N → S | 613 | 4,334 | 4,947 | 12.4% | [11.5, 13.3] |
The 7 Asn-derived pairs span a 3.95× range (48.9 / 12.4) in Pathogenic fraction.
3.2 The chemistry-class ranking
Tier 1 — Most Pathogenic Asn substitutions (P-fraction > 40%):
- N → I (48.9%): Polar-to-hydrophobic + bulky branched-chain. Disrupts N-glycosylation sequons, H-bonding networks, and surface polarity.
- N → Y (43.4%): Polar-to-aromatic + large volume increase. Disrupts active-site H-bonding and adds steric bulk.
- N → K (40.9%): Charge introduction (uncharged → basic). Disrupts surface electrostatics and N-glycosylation sequons.
Tier 2 — Mid-range Asn substitutions (P-fraction 24–28%):
- N → T (27.5%): Polar-to-polar with hydroxyl. Smaller volume but loses amide H-bond donor.
- N → H (26.4%): Polar-to-aromatic-ring with imidazole. Loses amide; gains aromatic + partial-positive charge.
- N → D (24.4%): Charge introduction (uncharged → acidic). Maximum electrostatic reversal but preserves geometry (Asn → Asp deamidation).
Tier 3 — Least Pathogenic Asn substitution (P-fraction < 15%):
- N → S (12.4%): Smaller polar substitution preserving H-bonding through hydroxyl. Loses amide; conservative volume change. Most chemistry-conservative N-derived substitution.
3.3 The N → S conservative-class minimum
N → S at 12.4% Pathogenic is the least Pathogenic Asn-derived substitution. Mechanism:
- Both Asn (-CH₂-CO-NH₂) and Ser (-CH₂-OH) are polar uncharged residues.
- Both can H-bond as donor and acceptor.
- Side-chain volume difference is ~25 ų (Asn larger).
- The chemistry change is loss of the amide carbonyl + amide nitrogen, replaced with a single hydroxyl.
For most surface-positioned Asn residues, Ser substitution is functionally interchangeable. The high Benign count (4,334) reflects population-genome variation: N → S is a common population variant.
3.4 The N → I Pathogenic-enriched signal
N → I at 48.9% Pathogenic is the most Pathogenic Asn-derived substitution. Mechanism:
- Polar amide (-CO-NH₂) replaced with hydrophobic branched-chain (CH-(CH₃)-CH₂-CH₃). Maximum chemistry disruption.
- Bulky alt residue may sterically clash in positions evolved for the smaller, polar Asn.
- N-glycosylation sequon (N-X-S/T) is destroyed: the Ile cannot accept N-linked glycans.
- For active-site Asn residues, the loss of H-bonding capability disrupts catalysis.
3.5 The N → D / Asn → Asp interpretation caveat
N → D at 24.4% Pathogenic is the substitution mimicking spontaneous deamidation of Asn → Asp (Robinson & Robinson 2001). In long-lived proteins, this deamidation occurs spontaneously at slow rates; ClinVar Pathogenic submissions for N → D variants represent the subset where the deamidation is functionally consequential (e.g., active-site residues, glycosylation sequons, structurally constrained loops). The 24.4% Pathogenic fraction is moderate, consistent with most Asn positions tolerating deamidation.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 ClinVar curatorial bias
Asn Pathogenic variants are over-reported in disease genes with critical Asn-functional residues — N-glycosylation sequons in secreted/membrane proteins (CFTR, factor IX, lysosomal hydrolases), catalytic Asn residues, and stable Asn positions in structured domains. The per-pair Pathogenic fractions partly reflect curation focus on these gene families.
4.3 Codon-mutability not normalized
Asn has 2 codons (AAT, AAC). The per-target-AA mutational rates differ across the 7 alt AAs reported. N → S (AAY → AGY), N → D (AAY → GAY), N → K (AAY → AAR), N → H (AAY → CAY), N → I (AAY → ATY), N → Y (AAY → TAY), N → T (AAY → ACY) are all single-nucleotide-transition accessible. We report the raw P-fraction observed in ClinVar.
4.4 Per-isoform first-element AA
We use the first finite element of dbnsfp.aa.ref and dbnsfp.aa.alt. ~5% per-isoform mismatch.
4.5 N-threshold sensitivity
We use ≥100 total per pair. Asn-derived substitutions with < 100 records (N → A, N → V, N → L, N → M, N → F, N → W, N → P, N → C, N → G, N → R, N → Q, N → E) are not analyzed.
4.6 Wilson CI assumes binomial sampling
Per-pair counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.7 ACMG-PP3/BP4 partial circularity
ClinVar Pathogenic / Benign labels are partly predictor-derived (PolyPhen / SIFT scores used as PP3 evidence). Some per-pair fractions reflect predictor-curator co-variance.
4.8 N-glycosylation sequon disruption is gene-context-specific
Many N-derived Pathogenic variants are in N-glycosylation sequons (N-X-S/T). Loss of glycosylation may be Pathogenic or Benign depending on the specific protein's reliance on glycosylation. We do not stratify by sequon-membership; the per-pair fractions are the unconditional aggregates.
5. Implications
- Among 7 Asn-derived substitution pairs, N → I is the most Pathogenic-enriched at 48.9% (Wilson CI [44.1, 53.7]) — driven by polar-to-hydrophobic chemistry disruption.
- N → S is the least Pathogenic-enriched at 12.4% [11.5, 13.3] — a conservative polar-to-polar substitution preserving H-bonding.
- The 3.95× per-target-AA range within Asparagine is one of the broadest we have observed in per-AA analyses, reflecting Asn's chemistry-class diversity in the substitution neighborhood.
- For variant-prioritization pipelines: per-target-AA priors within Asn should be applied; N → I ~49%, N → S ~12%.
- N-glycosylation sequon disruption is a likely contributor to the N → I/Y/K Pathogenic signal in secreted/membrane proteins.
6. Limitations
- Stop-gain excluded (§4.1).
- ClinVar curatorial bias (§4.2) toward N-glycosylation-sequon and catalytic-Asn gene families.
- No codon-mutability normalization (§4.3).
- Per-isoform first-element AA (§4.4).
- N-threshold ≥ 100 (§4.5) excludes 2-step-codon-distance pairs.
- ACMG-PP3 partial circularity (§4.7).
- No N-glycosylation-sequon stratification (§4.8).
7. Reproducibility
- Script:
analyze.js(Node.js, ~60 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-target-AA counts, P-fractions, Wilson 95% CIs, mean relative positions. - Verification mode: 6 machine-checkable assertions: (a) all P-fractions in [0, 1]; (b) Wilson CIs contain the point estimate; (c) all 7 reported pairs have N ≥ 100; (d) N→I P-fraction > 0.45; (e) N→S P-fraction < 0.15; (f) sample sizes match input file contents.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Bause, E. (1983). Structural requirements of N-glycosylation of proteins. Biochem. J. 209, 331–336.
- Robinson, N. E., & Robinson, A. B. (2001). Molecular clocks: deamidation of asparaginyl and glutaminyl residues in peptides and proteins. (Asn deamidation reference.)
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862–864.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.