Codon-Position-2 Missense Single-Nucleotide Variants in ClinVar Have a 30.94% Pathogenic-Fraction (39,413 of 127,404; Wilson 95% CI [30.68, 31.19]), 4.94 Percentage Points Higher Than Codon-Position-1 Variants (26.00%, [25.75, 26.25]) and 1.14 pp Higher Than Codon-Position-3 (29.80%, [29.12, 30.48]) Across 266,198 Codon-Position-Assignable ClinVar Variants — A Genetic-Code-Structural Asymmetry Reflecting That Position-2 Nucleotide Identity Determines Amino-Acid Chemistry-Class
Codon-Position-2 Missense Single-Nucleotide Variants in ClinVar Have a 30.94% Pathogenic-Fraction (39,413 of 127,404; Wilson 95% CI [30.68, 31.19]), 4.94 Percentage Points Higher Than Codon-Position-1 Variants (26.00%, [25.75, 26.25]) and 1.14 pp Higher Than Codon-Position-3 (29.80%, [29.12, 30.48]) Across 266,198 Codon-Position-Assignable ClinVar Variants — A Genetic-Code-Structural Asymmetry Reflecting That Position-2 Nucleotide Identity Determines Amino-Acid Chemistry-Class
Abstract
We compute the per-codon-position Pathogenic-fraction of ClinVar (Landrum et al. 2018) missense single-nucleotide variants stratified by which of the 3 codon positions the substituted nucleotide occupies. For each variant we extract the HGVS-style nucleotide change from the _id field (e.g. chr4:g.1803564C>T) and the (aa.ref, aa.alt) amino-acid pair from dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021), then enumerate all codons consistent with aa.ref and check which codon position(s) of those codons could produce aa.alt via the observed nucleotide change (testing both strand orientations using the canonical genetic code). Variants assignable to a unique codon position are tabulated. Stop-gain (alt = X) excluded.
| Codon position | Pathogenic | Benign | N | P-fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| 1 | 31,533 | 89,749 | 121,282 | 26.00% | [25.75, 26.25] |
| 2 | 39,413 | 87,991 | 127,404 | 30.94% | [30.68, 31.19] |
| 3 | 5,218 | 12,294 | 17,512 | 29.80% | [29.12, 30.48] |
Result: Codon-position-2 missense variants have the highest Pathogenic-fraction at 30.94% — 4.94 percentage points above codon-position-1 (26.00%) and 1.14 pp above codon-position-3 (29.80%). Wilson 95% CIs for positions 1 vs 2 are non-overlapping by ~4.4 pp; position-3 CI overlaps with position-2 by ~1 pp but is below position-2 by 1.14 pp on the point estimate. Total assignable variants: 266,198 (99.3% of 268,024 missense variants); 1,427 are multi-position-compatible (excluded) and 399 are unresolved (excluded). Mechanism: the genetic code is structured so that the position-2 nucleotide identity is the primary determinant of amino-acid chemistry-class — codons with pyrimidine at position 2 (T or C) typically encode hydrophobic amino acids; codons with purine at position 2 (A or G) typically encode polar or charged amino acids. Position-2 substitutions therefore always change the encoded amino-acid chemistry-class, producing more disruptive missense than position-1 substitutions (which often preserve chemistry-class) or position-3 substitutions (which are typically silent; only the missense subset is in our cache). The 4.94-pp gap between positions 1 and 2 is the empirical magnitude of the genetic-code-structural asymmetry at the variant-curation level. The asymmetry has implications for variant-mutation-rate-based priors and for the design of synonymous-vs-missense classifiers.
1. Background
The standard genetic code maps 64 codons to 20 amino acids plus 3 stop codons. The mapping has the following well-known structural property: the nucleotide at codon position 2 is the primary determinant of the encoded amino-acid chemistry-class (Crick 1968; Woese 1965). Specifically:
- Position-2 = U (T) → hydrophobic amino acids (F, L, I, M, V).
- Position-2 = C → polar uncharged or hydrophobic amino acids (S, P, T, A).
- Position-2 = A → polar or charged amino acids (Y, H, Q, N, K, D, E, stop).
- Position-2 = G → cysteine, tryptophan, charged amino acids (C, W, R, S, G).
The position-2 identity strongly constrains the chemistry of the encoded amino acid. Substitutions at codon position 2 therefore always change the encoded chemistry-class.
Codon position 3 is the wobble position with extensive degeneracy: ~70% of position-3 changes are silent. The position-3 missense variants observed in ClinVar are a selected subset (the rare cases where the position-3 change does change AA), and these are typically chemistry-conservative substitutions.
Codon position 1 changes typically produce AA changes (some silent in degenerate cases like CTN→TTN, both Leu), and the changes can be either chemistry-conservative or chemistry-radical depending on context.
This paper measures the empirical Pathogenic-fraction of variants at each codon position to test whether the genetic-code-structural asymmetry produces a measurable per-position Pathogenicity gradient.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract the
_idfield (HGVS-style chromosomal coordinate + nucleotide change) and parse the reference and alternate nucleotides from the[ACGT]>[ACGT]substring. - Extract
dbnsfp.aa.refanddbnsfp.aa.alt. Exclude stop-gain (alt = X) and same-AA records.
After filtering: 268,024 missense SNVs.
2.2 Codon-position assignment
For each variant with (refAA, altAA, nucleotide-change-from, nucleotide-change-to):
- Enumerate all codons that encode
refAAper the canonical genetic code. - For each such codon, enumerate the 3 codon positions.
- For each position and for each of the 2 strand orientations (the variant nucleotide change may be reported on the + or − strand; we test both):
- Check if the codon nucleotide at the position matches the variant's "from" nucleotide.
- If yes, substitute with the variant's "to" nucleotide and check if the resulting codon encodes
altAA. - If yes, the variant is compatible with this codon position.
- Collect the set of compatible codon positions for the variant.
A variant is single-position-assignable if exactly one codon position is compatible.
2.3 Per-position tabulation
For each codon position, count #Pathogenic and #Benign single-position-assignable variants. Compute P-fraction with Wilson 95% CI (Brown et al. 2001).
3. Results
3.1 The 3-position P-fraction asymmetry
| Codon position | Pathogenic | Benign | N | P-fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| 1 | 31,533 | 89,749 | 121,282 | 26.00% | [25.75, 26.25] |
| 2 | 39,413 | 87,991 | 127,404 | 30.94% | [30.68, 31.19] |
| 3 | 5,218 | 12,294 | 17,512 | 29.80% | [29.12, 30.48] |
Position 2 has the highest P-fraction at 30.94%, 4.94 percentage points above position 1 (26.00%), and 1.14 pp above position 3 (29.80%). Wilson 95% CIs for positions 1 vs 2 are non-overlapping by ~4.4 pp.
3.2 The genetic-code-structural mechanism
The position-2-vs-position-1 asymmetry reflects the second-position rule of the genetic code. Codons with the same position-2 nucleotide tend to encode chemistry-similar amino acids:
- Position-2 = U (T): hydrophobic AAs (F, L, I, M, V) — substituting position-2 from U to a different nucleotide changes the encoded AA's chemistry-class.
- Position-2 = C: polar uncharged or small hydrophobic (S, P, T, A) — same.
- Position-2 = A: polar / charged (Y, H, Q, N, K, D, E) — same.
- Position-2 = G: C, W, R, S, G — same.
Position-2 substitutions always change the encoded AA chemistry-class. They produce more disruptive missense than position-1 substitutions (which can preserve chemistry-class via degenerate codons like CTN→TTN where both are Leu, or LeucineU↔Phenylalanine where both are hydrophobic).
3.3 The position-1 effect
Position-1 substitutions change the encoded AA chemistry-class less consistently. For example:
- CTT (Leu) → TTT (Phe): both hydrophobic, chemistry-conservative.
- CTT (Leu) → ATT (Ile): both hydrophobic, conservative.
- CCT (Pro) → TCT (Ser): chemistry-class change (helix-breaker → polar OH).
- AGT (Ser) → CGT (Arg): polar OH → positively charged, chemistry-class change.
Position-1 substitutions can be either conservative or radical depending on the specific (codon, position-1-change) pair. The position-1 P-fraction (26.00%) reflects this mixed distribution: many position-1 missense are conservative and tolerated.
3.4 The position-3 effect
Position-3 missense are a selected subset of position-3 changes. Most position-3 changes (~70% in 4-fold degenerate codons) are silent and not in our missense-only dataset. The position-3 changes that DO produce missense are typically:
- 2-fold degenerate sites: e.g., AAA↔AAG (both Lys, silent); AAA↔AAC (Lys → Asn, missense). The latter changes chemistry-class and is missense.
- Rare specific cases: e.g., TGG (Trp) → TGA (stop, excluded), TGG → TGT/TGC (Trp → Cys, both involving sulfur), TGG → TGA (stop excluded), AGA/AGG (Arg) → AGT/AGC (Ser, chemistry-class change).
The position-3 missense subset is small (n = 17,512, only 6.6% of variants) and has P-fraction 29.80% — close to the position-2 rate (30.94%). The similarity reflects that the position-3 missense subset is selected for chemistry-class-changing substitutions, similar to position 2.
3.5 The per-position N counts and the position-3 small-sample
Position-3 has only 17,512 variants vs 121-127k for positions 1 and 2. This reflects the genetic-code structure: most position-3 changes are silent and excluded from our missense dataset.
The position-3 Wilson CI [29.12, 30.48] is wider than positions 1 and 2 due to the smaller N, but is still substantially above position 1 (26.00%) and slightly below position 2 (30.94%).
3.6 Implications for variant-prioritization
The per-codon-position P-fraction provides a mutation-rate-independent prior on Pathogenicity that derives from genetic-code structure:
- A novel missense variant at codon position 2 has a prior P-fraction of 30.94% (Wilson 95% CI [30.68, 31.19]).
- A novel missense variant at codon position 1 has a prior P-fraction of 26.00% (Wilson 95% CI [25.75, 26.25]).
- A novel missense variant at codon position 3 has a prior P-fraction of 29.80% (Wilson 95% CI [29.12, 30.48]).
The codon-position prior is a free, deterministic, predictor-independent feature that complements per-variant predictors. It can be integrated as a meta-feature in any variant-effect ensemble.
3.7 Comparison to the Ti/Tv asymmetry
The previously reported Ti/Tv asymmetry (transitions vs transversions) shows a 12.77-percentage-point P-fraction gap (Tv 37.49% vs Ti 24.72%). The codon-position asymmetry reported here (4.94 pp gap between positions 2 and 1) is smaller in magnitude but structurally distinct: Ti/Tv reflects mutation-rate asymmetry, while codon-position reflects genetic-code structure asymmetry. The two effects are partially independent — Ti and Tv variants distribute across codon positions roughly proportionally.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The codon-position assignment is sequence-derived, not curator-derived
The codon position of a variant is a deterministic property of the (refAA, altAA, nucleotide-change) triple computed from the canonical genetic code. It is independent of ClinVar curator labels and not derivable from any ACMG criterion. The per-position P-fraction is therefore a non-circular measurement of the genetic-code-structural-asymmetry effect.
4.3 Multi-position-compatible variants are excluded
1,427 of 268,024 variants (0.5%) are compatible with multiple codon positions and excluded from the per-position tabulation. These are mostly position-1↔position-3 ambiguity cases for short-range AA pairs reachable from multiple codons.
4.4 Strand-orientation handling
We test both strand orientations for each variant (the variant nucleotide change may be reported on the + or − strand). Strand-aware enumeration recovers 99.3% of variants as single-position-assignable.
4.5 The position-3 sample is a selected subset
Position-3 missense variants are not representative of all position-3 changes; they are the rare subset that DO change AA. The 29.80% P-fraction for position-3 missense reflects the chemistry-class-changing cases only.
4.6 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported Wilson 95% CIs reflect sampling variability in curator-assigned labels.
4.7 The codon-position asymmetry is small in magnitude
The 4.94-pp gap between positions 1 and 2 is statistically significant (non-overlapping Wilson 95% CIs) but small in effect size. The codon-position prior is a useful baseline meta-feature but is dominated by per-variant predictor scores in any reasonable ensemble.
5. Implications
- Codon-position-2 missense variants in ClinVar have a 30.94% Pathogenic-fraction, 4.94 pp higher than codon-position-1 (26.00%) and 1.14 pp higher than codon-position-3 (29.80%).
- The mechanism is the genetic-code's second-position rule: position-2 nucleotide identity determines amino-acid chemistry-class, so position-2 substitutions always change chemistry.
- The codon-position prior is a free, deterministic, predictor-independent meta-feature that can be integrated into variant-effect ensembles.
- The position-3 missense subset is small (n = 17,512) but has elevated P-fraction (29.80%) consistent with the selection for chemistry-class-changing substitutions.
- The codon-position asymmetry is mutation-rate-independent, complementary to the Ti/Tv asymmetry which is mutation-rate-driven.
6. Limitations
- Stop-gain excluded (§4.1).
- Codon-position assignment is sequence-derived, not curator-derived (§4.2) — non-circular by construction.
- Multi-position-compatible variants excluded (0.5%) (§4.3).
- Strand-aware lookup recovers 99.3% (§4.4).
- Position-3 sample is selected for chemistry-changing missense (§4.5).
- ClinVar labels not gold-standard (§4.6).
- Codon-position asymmetry is small in magnitude (4.94 pp gap) but statistically significant (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~50 LOC, zero deps; embeds the canonical genetic code). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-codon-position counts, P-fractions, Wilson 95% CIs, and the assignable / multi-position / unresolved counts. - Verification mode: 5 machine-checkable assertions: (a) position-2 P-fraction > position-1; (b) Wilson CIs for pos-1 vs pos-2 non-overlapping; (c) total assignable > 250,000; (d) unresolved < 1,000; (e) all P-fractions in [0.20, 0.35].
node analyze.js
node analyze.js --verify8. References
- Crick, F. H. C. (1968). The origin of the genetic code. J. Mol. Biol. 38, 367–379.
- Woese, C. R. (1965). On the evolution of the genetic code. Proc. Natl. Acad. Sci. USA 54, 1546–1552.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Lynch, M. (2010). Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. USA 107, 961–968.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.