Seven Amino-Acid-Substitution Pairs Are Reachable Only Via CpG-Mutational-Pathway Source Codons (RC, RH, RL, RP, RQ, SW, TM): These 7 of 150 Single-Nucleotide-Reachable AA Pairs (4.67%) Are 2.92× Over-Represented in ClinVar Missense Variants (13.64% of 267,625) and Have a 24.08% Pathogenic-Fraction Vs 30.03% for Non-CpG-Only Pairs
Seven Amino-Acid-Substitution Pairs Are Reachable Only Via CpG-Mutational-Pathway Source Codons (RC, RH, RL, RP, RQ, SW, TM): These 7 of 150 Single-Nucleotide-Reachable AA Pairs (4.67%) Are 2.92× Over-Represented in ClinVar Missense Variants (13.64% of 267,625 Variants) and Have a 24.08% Pathogenic-Fraction Vs 30.03% for Non-CpG-Only Pairs — Documenting the Genetic-Code-Architecture × CpG-Deamination Mutation-Rate Joint Quantification
Abstract
We enumerate the genetic-code-imposed CpG-mutational-pathway requirement for all 380 ordered amino-acid-substitution pairs (refAA, altAA). For each pair, we identify all (refAA-codon, altAA-codon) pairs differing in exactly one nucleotide position, and check whether the variant nucleotide position is part of a CG dinucleotide in the source codon (i.e., a CpG-context that mutates ~10× faster than non-CpG positions due to spontaneous deamination of 5-methylcytosine; Cooper & Krawczak 1990). Result: of 150 single-nucleotide-reachable AA-pairs:
- 7 pairs are CpG-required (reachable ONLY via CG-context source codons): RC, RH, RL, RP, RQ, SW, TM.
- 131 pairs are non-CpG-only (no CG-context paths).
- 12 pairs are mixed (both CpG and non-CpG paths).
The 7 CpG-required pairs comprise only 4.67% of reachable pairs, but account for 36,499 of 267,625 variants (13.64%) in ClinVar missense single-nucleotide variants (dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021); stop-gain alt = X excluded). Over-representation ratio: 2.92× — CpG-required AA-pairs are observed in ClinVar at nearly 3× their expected frequency from genetic-code structure alone.
| Class | Pairs | ClinVar Variants | P-fraction | Wilson 95% CI |
|---|---|---|---|---|
| CpG-required | 7 | 36,499 | 24.08% | [23.64, 24.52] |
| Mixed (both paths) | 12 | 33,865 | 25.84% | [25.37, 26.31] |
| Non-CpG-only | 131 | 197,261 | 30.03% | [29.83, 30.23] |
Mechanism: the 2.92× over-representation reflects the CpG-deamination mutation-rate amplification (Cooper & Krawczak 1990; Lynch 2010). Methylated cytosines in CG dinucleotides spontaneously deaminate to thymine at ~10× the background nucleotide-substitution rate, producing C>T transitions (and G>A on the opposite strand). The 7 CpG-required pairs (R↔C, R↔H, R↔L, R↔P, R↔Q, S↔W, T↔M — note the prevalence of arginine, encoded by CGN codons containing CpG) inherit this mutation-rate amplification. The 24.08% Pathogenic-fraction in CpG-required pairs (vs 30.03% in non-CpG-only) reflects the inverse relationship: the higher mutation rate produces more recurrent variants that accumulate as Benign in population databases, depressing the per-class Pathogenic-fraction. For variant-prioritization: a novel variant in a CpG-required pair has a 1.25× lower Pathogenicity prior (24.08% vs 30.03%) than a non-CpG-only variant — not because it's intrinsically less disease-causing, but because the higher mutation rate has accumulated more Benign instances in population data. Both classifications are non-circular (genetic-code-derived structural property of each AA pair) and provide actionable per-variant metadata.
1. Background
The CpG-deamination mutation rate amplification (Cooper & Krawczak 1990; Lynch 2010): methylated cytosines (5-methylcytosine) in CG dinucleotides spontaneously deaminate to thymine at ~10× the background nucleotide-substitution rate. The C>T transition (G>A on opposite strand) is the dominant CpG-mediated mutation. Approximately 70% of human point mutations causing genetic disease are at CpG sites (Cooper & Krawczak 1990).
The genetic code assigns codons to amino acids in a fixed mapping. Some amino-acid substitutions can be reached from a refAA codon to an altAA codon via a CG-containing source codon position; others can only be reached via non-CG source codons. The CpG-required AA pairs are those where the only single-nucleotide-mutation paths require a CpG-context source codon — these pairs inherit the mutation-rate amplification.
This paper enumerates the CpG-required AA-pair set, quantifies its empirical over-representation in ClinVar, and documents the per-class Pathogenicity-fraction asymmetry.
2. Method
2.1 Enumerate single-nucleotide-reachable pairs
For each of the 380 ordered (refAA, altAA) pairs with refAA ≠ altAA:
- Enumerate all codons encoding refAA and all codons encoding altAA.
- For each (refAA-codon, altAA-codon) pair with Hamming distance = 1, identify the changed position p (0, 1, or 2).
- Classify as CG-context if the codon position p is part of a CG dinucleotide within the codon (positions 0-1, 1-2, etc.).
2.2 Classify per-pair CpG dependency
A pair is CpG-required if all single-nucleotide-mutation paths require CG-context source codons. A pair is non-CpG-only if no path requires CG-context. A pair is mixed if both CG and non-CG paths exist.
2.3 ClinVar variant tabulation
For each ClinVar missense single-nucleotide variant (dbNSFP v4 via MyVariant.info; stop-gain alt = X excluded), classify into the per-pair CpG class. Tabulate Pathogenic and Benign counts. Compute Pathogenic-fraction with Wilson 95% CI (Brown et al. 2001).
2.4 Over-representation analysis
Compare per-class variant frequency (observed) to per-class pair frequency (expected from uniform per-pair distribution).
3. Results
3.1 The 7 CpG-required AA-pairs
| Pair | Source codon path | Mechanism |
|---|---|---|
| R → C | CGN → TGN (position 1) | Arg-Cys via CGY→TGY |
| R → H | CGN → CAN (position 2) | Arg-His via CGN→CAN |
| R → L | CGN → CTN (position 2) | Arg-Leu via CGN→CTN |
| R → P | CGN → CCN (position 2) | Arg-Pro via CGN→CCN |
| R → Q | CGN → CAN (position 2) | Arg-Gln via CGA→CAA, CGG→CAG |
| S → W | TCG → TGG (position 2) | Ser-Trp via Ser-TCG codon → Trp-TGG |
| T → M | ACG → ATG (position 2) | Thr-Met via Thr-ACG codon → Met-ATG |
The list is dominated by arginine-involving substitutions (5 of 7), because R is encoded by 6 CGN codons that contain CpG at positions 1-2. The remaining 2 (S→W, T→M) involve specific Ser-TCG and Thr-ACG codons that contain CpG at positions 2-3.
3.2 The 4-class distribution
| Class | Pairs | ClinVar Variants | Variant fraction | Pair fraction | Over-rep |
|---|---|---|---|---|---|
| CpG-required | 7 | 36,499 | 13.64% | 4.67% | 2.92× |
| Mixed | 12 | 33,865 | 12.66% | 8.00% | 1.58× |
| Non-CpG-only | 131 | 197,261 | 73.71% | 87.33% | 0.84× |
| (Unreachable) | 230 | — | — | — | — |
The 7 CpG-required pairs are 2.92× over-represented in ClinVar variants relative to their share of reachable pair space (4.67% of pairs but 13.64% of variants). The 131 non-CpG-only pairs are slightly under-represented (87.33% of pairs but 73.71% of variants).
3.3 The Pathogenic-fraction asymmetry
| Class | Pathogenic | Benign | N | P-fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| CpG-required | 8,788 | 27,711 | 36,499 | 24.08% | [23.64, 24.52] |
| Mixed | 8,750 | 25,115 | 33,865 | 25.84% | [25.37, 26.31] |
| Non-CpG-only | 59,235 | 138,026 | 197,261 | 30.03% | [29.83, 30.23] |
The CpG-required pairs have lower P-fraction (24.08%) than non-CpG-only pairs (30.03%) by 5.95 percentage points — a 1.25× relative depression. Wilson 95% CIs are non-overlapping by ~5 pp.
3.4 The mechanism: mutation-rate vs Pathogenic-fraction inverse relationship
The CpG-required pairs have higher mutation rate (~10× CpG amplification) but lower Pathogenic-fraction:
- High mutation rate → more variants observed.
- More variants observed → more recurrent variants in healthy populations.
- Recurrent variants in healthy populations → more Benign curations in ClinVar.
- More Benign curations → lower per-class Pathogenic-fraction.
The 2.92× over-representation × 1.25× Pathogenicity-depression = the integrated effect of CpG-amplified mutation rate on ClinVar curation patterns. Both effects are consistent with the underlying CpG-deamination mechanism.
3.5 The Pathogenic count is still high in absolute terms
Despite the lower Pathogenic-fraction, the 7 CpG-required pairs account for 8,788 Pathogenic variants in ClinVar — substantial in absolute numbers due to the high mutation rate.
3.6 The 131 non-CpG-only pairs have higher P-fraction
Non-CpG-only AA pairs (e.g., A↔V, F↔Y, K↔N, etc.) have lower mutation rate, fewer recurrent variants, and a higher Pathogenic-fraction (30.03%) reflecting that observed variants are more often non-recurrent disease-relevant.
3.7 Implications for variant-prioritization
The CpG-required AA-pair classification is a non-circular metadata feature that captures mutation-rate effects on per-pair ClinVar curation:
- A novel variant in a CpG-required pair (RC, RH, RL, RP, RQ, SW, TM): prior P-fraction 24.08% — slightly Benign-leaning relative to non-CpG variants.
- A novel variant in a non-CpG-only pair: prior P-fraction 30.03% — slightly Pathogenic-leaning.
The 5.95-pp Pathogenic-fraction gap reflects the mutation-rate-driven curation asymmetry, not intrinsic biological severity. Variant-prioritization pipelines should incorporate this prior to avoid systematic mis-calibration on the 36,499 ClinVar variants in CpG-required AA pairs.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The CpG-required classification is sequence-derived
The classification depends only on the genetic code structure and the AA-pair identity. Non-circular: independent of ClinVar curation, predictor scores, or any modern annotation.
4.3 The 10× mutation-rate amplification is well-established
CpG-deamination at ~10× background is documented in Cooper & Krawczak 1990, Lynch 2010, and many subsequent studies.
4.4 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported Wilson 95% CIs reflect sampling variability.
4.5 The 7-pair CpG-required set is small
The 7 pairs are arginine-dominated. Arginine involvement creates a per-AA confound that's structurally inseparable from the CpG-required classification (arginine is the only AA with all 6 codons CG-containing).
4.6 The mixed-paths pairs (12 pairs) are intermediate
Mixed pairs (which have both CpG and non-CpG paths) have intermediate Pathogenic-fraction (25.84%) — between CpG-required (24.08%) and non-CpG-only (30.03%). Consistent with partial mutation-rate amplification.
4.7 The over-representation reflects observed variants, not biological frequency
The 2.92× over-representation in observed ClinVar variants reflects the integrated effect of mutation rate × ascertainment. The actual biological-event rate would be similar.
5. Implications
- 7 of 150 single-nucleotide-reachable AA-pairs (4.67%) are CpG-required (reachable only via CG-context source codons): RC, RH, RL, RP, RQ, SW, TM.
- These 7 pairs are 2.92× over-represented in ClinVar missense variants (13.64% of observed vs 4.67% of reachable pairs).
- CpG-required pairs have a 24.08% Pathogenic-fraction vs 30.03% for non-CpG-only pairs — 1.25× relative depression.
- The mechanism is CpG-deamination mutation-rate amplification (~10×): higher mutation rate produces more recurrent variants curated as Benign in population databases.
- For variant-prioritization: the CpG-required AA-pair flag is a precomputable non-circular metadata feature with 5.95-pp Pathogenic-fraction depression that should be applied as a calibration correction.
6. Limitations
- Stop-gain excluded (§4.1).
- CpG-required classification is sequence-derived, non-circular (§4.2).
- 10× CpG amplification is well-established (§4.3).
- ClinVar labels not gold-standard (§4.4).
- CpG-required set is small (7 pairs) and arginine-dominated (§4.5).
- Mixed-paths pairs are intermediate (§4.6).
- Over-representation reflects observed variants (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~70 LOC; embeds genetic code and CpG-context detection; zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith the 4-class structural counts, ClinVar variant counts per class, Wilson 95% CIs, and over-representation ratio. - Verification mode: 5 machine-checkable assertions: (a) 7 CpG-required pairs identified; (b) over-representation ratio > 2.5×; (c) CpG-required P-fraction < 26%; (d) non-CpG-only P-fraction > 29%; (e) all 7 CpG-required pairs are R-involving except SW and TM.
node analyze.js
node analyze.js --verify8. References
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Lynch, M. (2010). Rate, molecular spectrum, and consequences of human mutation. Proc. Natl. Acad. Sci. USA 107, 961–968.
- Crick, F. H. C. (1968). The origin of the genetic code. J. Mol. Biol. 38, 367–379.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Bird, A. P. (1980). DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res. 8, 1499–1504.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.