The Standard Genetic Code Limits ClinVar Single-Nucleotide Missense Variants to 150 of 380 Possible Amino-Acid Substitutions (39.5% Reachable; Hamming-Distance-1), With Conservative Substitutions (Grantham < 50) 1.84× Over-Represented Among Reachable Pairs Vs Radical Pairs (59.38% of 64 Conservative Pairs Reachable Vs 32.26% of 62 Radical Pairs) — A Codon-Architecture-Imposed Limit on Single-Nucleotide-Variant Pathogenicity Diversity
The Standard Genetic Code Limits ClinVar Single-Nucleotide Missense Variants to 150 of 380 Possible Amino-Acid Substitutions (39.5% Reachable; Hamming-Distance-1), With Conservative Substitutions (Grantham < 50) 1.84× Over-Represented Among Reachable Pairs Vs Radical Pairs (59.38% of 64 Conservative Pairs Reachable Vs 32.26% of 62 Radical Pairs) — A Codon-Architecture-Imposed Limit on Single-Nucleotide-Variant Pathogenicity Diversity
Abstract
We enumerate the codon-Hamming-distance-1 reachability of all 380 ordered amino-acid-substitution pairs (refAA, altAA) with refAA ≠ altAA under the standard human genetic code. For each pair, we compute the minimum Hamming distance between any codon encoding refAA and any codon encoding altAA. Single-nucleotide-reachable pairs are those with min-Hamming = 1 — i.e., a single base substitution can convert a refAA codon to an altAA codon. Result:
- 150 of 380 ordered pairs (39.47%) are reachable from a single nucleotide change.
- 230 of 380 (60.53%) are unreachable and require ≥ 2 nucleotide changes.
- Of the 230 unreachable pairs: 202 require min 2 changes; 28 require min 3 changes (the latter is the maximum possible for any 3-nucleotide codon).
Examples of completely unreachable pairs (require ≥ 3 nucleotide changes): C↔E, C↔K, C↔M, C↔Q, D↔M, D↔W, F↔K, F↔Q, H↔M, H↔W, I↔W, M↔Y, N↔W, W↔D, W↔H, W↔I, W↔N. Trp, Cys, Met-involving pairs dominate the unreachable subset because these AAs have few codons (W = TGG only; C = TGT/TGC; M = ATG only). Grantham-distance-bin enrichment among reachable pairs:
| Grantham bin | Reachable pairs | All pairs | Reachable-fraction-of-bin |
|---|---|---|---|
| Conservative (G < 50) | 38 | 64 | 59.38% |
| Mod-Conservative (50–99) | 58 | 140 | 41.43% |
| Mod-Radical (100–149) | 34 | 114 | 29.82% |
| Radical (G ≥ 150) | 20 | 62 | 32.26% |
Conservative substitutions are 1.84× over-represented in the single-nucleotide-reachable subset vs the all-pairs background (59.38% vs 32.26% reachable-fraction). This is the classical genetic-code error-minimization property (Freeland & Hurst 1998): the standard genetic code is structured such that single-nucleotide-mutation errors tend to produce chemistry-conservative amino-acid substitutions, minimizing the fitness impact of point mutations. Within the ClinVar single-nucleotide variant cache (267,625 reachable-pair variants), the per-bin Pathogenic-fractions are 18.62% (Conservative), 27.01% (Mod-Conservative), 44.42% (Mod-Radical), 49.83% (Radical) — a clean monotonic gradient consistent with chemistry-distance predicting Pathogenicity. For variant interpretation: the 230 unreachable AA-pair substitutions cannot occur as ClinVar single-nucleotide variants and are observed only in multi-nucleotide-variant or compound-heterozygous contexts.
1. Background
The standard genetic code maps 64 codons to 20 amino acids plus 3 stop codons. Each amino acid has between 1 (M, W) and 6 (L, R, S) codons assigned to it. The code is highly degenerate at codon position 3 (wobble) and chemistry-class-organized at codon position 2 (Crick 1968).
A consequence of the degeneracy and chemistry-class-organization: single-nucleotide changes can produce only a subset of all 380 possible amino-acid substitutions. For a substitution (refAA, altAA) to be reachable from a single nucleotide change, there must exist a refAA-codon and an altAA-codon differing in exactly one nucleotide position. AA pairs whose codons all differ at ≥ 2 positions are unreachable in single nucleotide variation.
The error-minimization hypothesis (Freeland & Hurst 1998; Higgs 2009) posits that the standard genetic code is structured such that single-nucleotide errors tend to produce chemistry-conservative substitutions. The hypothesis has been quantified by comparing the standard code to randomly-permuted alternatives, finding the standard code to be in the top 0.1% of all possible codes for chemistry-error minimization.
This paper provides the direct enumeration of single-nucleotide-reachable AA-pair substitutions and characterizes their chemistry-distance distribution, with implications for the ClinVar single-nucleotide variant cache.
2. Method
2.1 Genetic code
Standard human (eukaryotic / mitochondrial-context-independent) genetic code: 61 sense codons + 3 stop codons. Each of the 20 standard amino acids has between 1 and 6 sense codons.
2.2 Single-nucleotide reachability
For each of the 380 ordered (refAA, altAA) pairs with refAA ≠ altAA:
- Enumerate all codons encoding
refAAand all codons encodingaltAA. - For each
(refAA-codon, altAA-codon)pair, compute the Hamming distance (number of differing nucleotide positions). - Take the minimum Hamming distance over all codon-pair combinations as the min-Hamming-distance of the AA pair.
- A pair is single-nucleotide-reachable if min-Hamming = 1.
2.3 Grantham-distance binning
Standard Li-1984 Grantham bins (Grantham 1974; Li et al. 1984):
- Conservative: G < 50
- Mod-Conservative: 50 ≤ G < 100
- Mod-Radical: 100 ≤ G < 150
- Radical: G ≥ 150
2.4 ClinVar P-fraction within reachable pairs
For each of the 150 reachable pairs, count the ClinVar single-nucleotide missense variants in the dbNSFP v4 (Liu et al. 2020) cache via MyVariant.info (Wu et al. 2021); stop-gain (alt = X) excluded. Compute per-Grantham-bin P-fraction with Wilson 95% CI (Brown et al. 2001).
After filtering: 267,625 ClinVar single-nucleotide variants in the 150 reachable pairs.
3. Results
3.1 The 150-of-380 reachable pair count
- Single-nucleotide-reachable pairs (min-Hamming = 1): 150 of 380 = 39.47%.
- Unreachable (min-Hamming ≥ 2): 230 of 380 = 60.53%.
- Of the 230 unreachable: min-Hamming-2: 202 pairs (require 2 changes); min-Hamming-3: 28 pairs (require 3 changes — codon-distance-maximal).
3.2 The 28 maximally-distant pairs (min-Hamming = 3)
The 28 pairs requiring all 3 codon positions to change to convert refAA to altAA include:
C → E, C → K, C → M, C → Q (Cys to negatively-charged or to other sulfur-AA via C↔M) D → M, D → W (Asp to Met or Trp) E → C, E → F (Glu to Cys or Phe) F → K, F → Q (Phe to Lys or Gln) H → M, H → W (His to Met or Trp) I → W (Ile to Trp) K → C, K → F (Lys to Cys or Phe) M → C, M → D, M → H, M → Y (Met to several distant AAs) N → W (Asn to Trp) Q → C, Q → F (Gln to Cys or Phe) W → D, W → H, W → I, W → N (Trp to several distant AAs) Y → M (Tyr to Met)
These pairs cluster around Trp, Cys, and Met involvement. The common factor: W, C, M each have only 1-2 codons, so the codon-distance to other AAs' codons is large.
3.3 The Grantham-bin enrichment of reachable pairs
| Grantham bin | All pairs | Reachable | Reachable-fraction |
|---|---|---|---|
| Conservative (G < 50) | 64 | 38 | 59.38% |
| Mod-Conservative (50-99) | 140 | 58 | 41.43% |
| Mod-Radical (100-149) | 114 | 34 | 29.82% |
| Radical (G ≥ 150) | 62 | 20 | 32.26% |
| Total | 380 | 150 | 39.47% |
The Conservative bin has 1.84× the reachable-fraction of the Radical bin (59.38% / 32.26%). The intermediate bins fall in between (41.43% Mod-Conservative; 29.82% Mod-Radical). The pattern is monotonic-decreasing in chemistry-distance up to Mod-Radical, with a slight uptick at Radical.
This is the error-minimization property of the standard genetic code at the per-pair-bin level: the code is structured such that the AA-pair substitutions reachable by single nucleotide changes are biased toward chemistry-conservative substitutions.
3.4 The ClinVar empirical P-fraction within reachable pairs
Restricted to the 150 reachable pairs, the per-Grantham-bin ClinVar P-fraction:
| Grantham bin | Pathogenic | Benign | N | P-fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| Conservative (< 50) | 17,830 | 77,940 | 95,770 | 18.62% | [18.37, 18.87] |
| Mod-Conservative (50-99) | 28,909 | 78,125 | 107,034 | 27.01% | [26.74, 27.28] |
| Mod-Radical (100-149) | 18,599 | 23,276 | 41,875 | 44.42% | [43.94, 44.89] |
| Radical (≥ 150) | 11,435 | 11,511 | 22,946 | 49.83% | [49.19, 50.48] |
Within reachable pairs, the per-bin P-fraction is monotonic in Grantham distance from 18.62% (Conservative) to 49.83% (Radical) — a 2.68× ratio. The pattern is consistent with the Grantham distance carrying predictive signal for variant Pathogenicity.
3.5 The combined picture
The 150 reachable pairs are enriched for chemistry-conservative substitutions (Conservative bin reachable-fraction 59.4% vs Radical 32.3%). This means the AA substitutions that ClinVar single-nucleotide variants can produce are biased toward functionally-tolerable substitutions by genetic-code architecture.
Of the variants that DO occur in reachable pairs:
- Those in the Conservative bin (over-represented) have low P-fraction (18.62%).
- Those in the Radical bin (under-represented but not absent) have high P-fraction (49.83%).
The combination produces the global ClinVar single-nucleotide missense P-fraction of 28.7% observed in our cache: a weighted average of the per-bin P-fractions, with the Conservative bin contributing the largest share.
3.6 Implications for variant interpretation
The 230 unreachable AA-pair substitutions cannot occur as ClinVar single-nucleotide variants. Variants in these AA-pair classes are observed only in:
- Multi-nucleotide variants (MNVs): rare events where 2-3 adjacent nucleotides change simultaneously.
- Compound heterozygous combinations: two single-nucleotide variants on the same allele.
- Insertion / deletion / frameshift events: which can produce arbitrary AA substitutions but are not single-nucleotide.
The error-minimization structure means ClinVar single-nucleotide P-fraction is intrinsically lower than would be expected from a random-AA-substitution distribution. The genetic-code-imposed bias toward Conservative substitutions depresses the Pathogenic-fraction.
For variant-prioritization: variants in the reachable-but-Radical subset (e.g., R↔W, V↔D, L↔P) are special: they occupy the top of the chemistry-distance distribution despite being single-nucleotide-reachable, and have ~50% P-fraction. These variants are the high-value targets for clinical interpretation.
3.7 The W, C, M concentration of unreachable pairs
The 28 maximally-distant (min-Hamming = 3) pairs are concentrated in W, C, M-involving substitutions because these AAs have few codons:
- W has only 1 codon (TGG).
- M has only 1 codon (ATG).
- C has 2 codons (TGT, TGC).
The single-codon AAs (W, M) have the most-restricted codon-distance to other AAs' codons. This is a structural property of the genetic-code design.
For ClinVar interpretation: substitutions involving W, M, or C are under-represented in single-nucleotide variant data because of codon-distance constraints, even when these substitutions would be highly Pathogenic if observed.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The reachability metric is purely structural
The min-Hamming-distance metric depends only on the genetic code structure and the AA-pair identity. It is independent of ClinVar curator labels and any predictor scores. The 150-of-380 finding is a deterministic property of the standard genetic code.
4.3 The Grantham scores are external
Grantham distances are from the original Grantham (1974) scale. The chemistry-bin-enrichment finding is independent of any modern predictor.
4.4 The standard genetic code is human-applicable
We use the standard nuclear genetic code (applies to most human genes). Mitochondrial genes use slight code variations (e.g., AGA = stop in mtDNA, not Arg). Mitochondrial variants in ClinVar (smaller subset) may have slightly different reachability properties.
4.5 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported P-fractions reflect curator-assigned data.
4.6 The reachable-pair coverage in ClinVar may be sparse for rare pairs
Some reachable pairs have small ClinVar variant counts (e.g., W-involving pairs have low N because W is the rarest AA). The per-bin P-fractions are well-supported (smallest bin n = 22,946) but per-pair P-fractions for rare pairs may have wider CIs.
4.7 Multi-nucleotide variants are not in our dataset
ClinVar contains multi-nucleotide variants (MNVs) that can produce unreachable AA-pair substitutions. Our dataset is single-nucleotide variants only; the 230 unreachable pairs would appear if MNVs were included.
5. Implications
- The standard genetic code limits ClinVar single-nucleotide missense variants to 150 of 380 possible AA substitutions (39.47% reachable; 60.53% unreachable).
- Conservative substitutions (Grantham < 50) are 1.84× over-represented in the reachable subset vs Radical pairs — the classical genetic-code error-minimization property.
- 28 AA pairs are maximally-distant (min-Hamming = 3), concentrated in W, C, M-involving substitutions due to single-codon assignments of these AAs.
- Within reachable pairs, the ClinVar P-fraction monotonically increases with Grantham distance from 18.62% (Conservative) to 49.83% (Radical).
- For variant interpretation: the 230 unreachable AA-pair substitutions appear only in multi-nucleotide variants, compound heterozygous combinations, or insertion/deletion contexts — not in single-nucleotide variant data.
6. Limitations
- Stop-gain excluded (§4.1).
- Reachability metric is structural (§4.2) — non-circular by construction.
- Grantham scores are external to ClinVar (§4.3).
- Standard genetic code applies to most human genes; mitochondrial variants slightly differ (§4.4).
- ClinVar labels not gold-standard (§4.5).
- Some reachable pairs have small N (§4.6).
- Multi-nucleotide variants not in dataset (§4.7) — would extend the analyzable subset.
7. Reproducibility
- Script:
analyze.js(Node.js, ~80 LOC; embeds the canonical genetic code and Grantham matrix). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith reachable / unreachable counts, per-Grantham-bin reachable-fraction, and ClinVar per-bin P-fractions with Wilson 95% CIs. - Verification mode: 5 machine-checkable assertions: (a) 150 reachable pairs; (b) 230 unreachable; (c) Conservative bin reachable-fraction > 50%; (d) Radical bin reachable-fraction < 40%; (e) Conservative-bin / Radical-bin ratio > 1.5×.
node analyze.js
node analyze.js --verify8. References
- Crick, F. H. C. (1968). The origin of the genetic code. J. Mol. Biol. 38, 367–379.
- Woese, C. R. (1965). On the evolution of the genetic code. Proc. Natl. Acad. Sci. USA 54, 1546–1552.
- Freeland, S. J., & Hurst, L. D. (1998). The genetic code is one in a million. J. Mol. Evol. 47, 238–248.
- Higgs, P. G. (2009). A four-column theory for the origin of the genetic code: tracing the evolutionary pathways that gave rise to an optimized code. Biol. Direct 4, 16.
- Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862–864.
- Li, W. H., Wu, C. I., & Luo, C. C. (1984). Nonrandomness of point mutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications. J. Mol. Evol. 21, 58–71.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.