Among 7 Histidine-Reference Substitution Pairs in ClinVar Missense Variants With ≥100 Records: His→Pro Is the Most Pathogenic-Enriched (54.0% Pathogenic, Wilson 95% CI [49.7, 58.2]) and His→Gln Is the Least (22.5% [20.1, 25.1]) — A 2.4× Range Within the Same Reference Amino Acid
Among 7 Histidine-Reference Substitution Pairs in ClinVar Missense Variants With ≥100 Records: His→Pro Is the Most Pathogenic-Enriched (54.0% Pathogenic, Wilson 95% CI [49.7, 58.2]) and His→Gln Is the Least (22.5% [20.1, 25.1]) — A 2.4× Range Within the Same Reference Amino Acid
Abstract
We compute the per-substitution-target-amino-acid Pathogenic fraction for the 7 Histidine-reference (His, H) substitution pairs with ≥100 ClinVar missense single-nucleotide variants in the dbNSFP v4 (Liu et al. 2020) annotation of 372,927 ClinVar Pathogenic+Benign records (Landrum et al. 2018) returned by MyVariant.info (Wu et al. 2021), with Wilson 95% confidence intervals (Wilson 1927) on each per-pair fraction. Stop-gain (aa.alt = X) is explicitly excluded. Result: per-target-AA Pathogenic fractions span a 2.4× range from 22.5% (H → Q) to 54.0% (H → P): H→P 54.0% Wilson CI [49.7, 58.2]; H→D 44.4% [39.5, 49.4]; H→L 41.8% [36.9, 47.0]; H→R 27.4% [25.5, 29.3]; H→Y 26.6% [24.5, 28.7]; H→N 24.5% [20.3, 29.1]; H→Q 22.5% [20.1, 25.1]. The chemistry interpretation: the most Pathogenic-enriched alt AAs are proline (helix-breaker; disrupts secondary structure regardless of His position context), aspartate (charge inversion: replaces partial-positive His side chain with full-negative Asp side chain), and leucine (charge loss + introduction of bulky hydrophobic residue). The least Pathogenic-enriched are glutamine (polar but uncharged; minimal chemistry change beyond loss of partial positive charge), asparagine (similar to Gln; smaller polar substitution), tyrosine (aromatic with hydroxyl; preserves ring structure but loses charge), and arginine (basic-to-basic conservative substitution). The H → R conservative pair at 27.4% Pathogenic is the most-Benign Histidine substitution that preserves the basic character — analogous to R → K being the most-Benign Arg substitution observed in independent per-AA analyses. For variant-prioritization pipelines: an observed H → P substitution carries a 54% Pathogenic prior; H → Q only 22.5% — a 2.4× per-prior difference within the same reference AA. Histidine pathogenicity is dominated by introduction of proline (helix-breaker) or charge-inversion to aspartate; conservative replacements (R, Y, N, Q) are well-tolerated.
1. Background
Histidine (His, H) is a partially-positively-charged amino acid (side-chain pK_a ≈ 6.0; ~10% protonated at physiological pH 7.4). His residues are unique among the 20 standard amino acids in their proton-buffering role at physiological pH and are essential cofactors in:
- Acid-base catalysis at enzyme active sites (e.g., the catalytic His in serine-protease catalytic triads; the proton shuttle His in carbonic anhydrase).
- Metal coordination: the imidazole ring is a strong ligand for Zn²⁺, Cu²⁺, Fe²⁺, Mg²⁺ (e.g., Zn²⁺ in carbonic anhydrase, Fe²⁺ in heme proteins).
- pH-sensitive structural switches (e.g., His-mediated subunit dissociation in influenza HA at low pH).
His is one of two "ambiguous-class" amino acids (with Cys) whose chemistry partly overlaps multiple categories (basic + polar + aromatic-ring-containing). This paper measures the per-target-AA Pathogenic-fraction distribution within the His-reference subset.
2. Method
ClinVar missense (alt ≠ X) variants from MyVariant.info / dbNSFP v4. Restrict to dbnsfp.aa.ref = H; group by alt AA; require ≥100 total per pair. Wilson 95% CI on the per-pair Pathogenic fraction. The "Mean rel pos" column reported in result.json is the per-pair mean of aa.pos / protein_length across all Pathogenic variants in the pair — a centrality summary of where in the protein these substitutions occur. We report it for completeness; for the His-reference set, all per-pair mean relative positions cluster around 0.47–0.53 (essentially uniform along the protein).
3. Results
3.1 Per-target-AA Pathogenic fraction (sorted descending)
| H → alt | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI | Mean rel pos |
|---|---|---|---|---|---|---|
| H → P | 278 | 237 | 515 | 54.0% | [49.7, 58.2] | 0.491 |
| H → D | 170 | 213 | 383 | 44.4% | [39.5, 49.4] | 0.530 |
| H → L | 151 | 210 | 361 | 41.8% | [36.9, 47.0] | 0.515 |
| H → R | 599 | 1,590 | 2,189 | 27.4% | [25.5, 29.3] | 0.526 |
| H → Y | 457 | 1,264 | 1,721 | 26.6% | [24.5, 28.7] | 0.533 |
| H → N | 90 | 278 | 368 | 24.5% | [20.3, 29.1] | 0.473 |
| H → Q | 246 | 846 | 1,092 | 22.5% | [20.1, 25.1] | 0.497 |
The 7 His-derived pairs span a 2.4× range (54.0 / 22.5 = 2.4×) in Pathogenic fraction.
3.2 The chemistry-class ranking
Tier 1 — Most Pathogenic His substitutions (P-fraction > 40%):
- H → P (54.0%): Proline introduction is a helix-breaker; disrupts secondary structure regardless of His's pre-substitution chemistry context.
- H → D (44.4%): Charge inversion. Replaces partially-positive imidazole (~+0.1 e at pH 7.4) with strongly-negative carboxylate (-1.0 e). Maximum electrostatic-disruption substitution within the H-derived set.
- H → L (41.8%): Charge loss + introduction of bulky hydrophobic residue. Disrupts His's polar / partial-charge character.
Tier 2 — Less-Pathogenic His substitutions (P-fraction 22–28%):
- H → R (27.4%): Basic-to-basic conservative substitution. Preserves positive charge at physiological pH (Arg pK_a ≈ 12). The most-conservative His-derived charge-preserving substitution.
- H → Y (26.6%): Aromatic-to-aromatic substitution preserving the ring structure (Tyr's phenol ring vs His's imidazole). Loses charge and metal-coordination capability but preserves geometry.
- H → N (24.5%): Polar amide; minimal chemistry change beyond loss of partial positive charge and aromatic-ring character.
- H → Q (22.5%): Polar amide (one CH₂ longer than Asn); chemistry-conservative substitution within the polar-uncharged class. The most-Benign His-derived substitution.
3.3 The H → Q most-Benign signal
H → Q at 22.5% Pathogenic is the most-Benign His-derived substitution. Mechanism:
- Both His and Gln have polar side chains capable of H-bonding.
- Gln's amide group can substitute for His's imidazole H-bond donor function in many positions.
- The chemistry change is small (loss of partial positive charge; loss of aromatic ring; gain of one amide).
- Functional consequences are minimal in most contexts.
The high Benign count (846) reflects population-genome variation: H → Q is a common population variant in many genes.
3.4 The H → P most-Pathogenic signal
H → P at 54.0% Pathogenic is the most-Pathogenic His-derived substitution. Mechanism:
- Proline introduction breaks the φ-angle of the polypeptide backbone (MacArthur & Thornton 1991), disrupting α-helix and β-sheet geometry.
- The pre-substitution chemistry (His's imidazole / partial positive charge) is irrelevant; the disruption is structural, not electrostatic.
- The H → P pair is also a 2-step codon transition (CAY → CCY, where Y = T/C); the mutational rate is moderate, neither CpG-elevated nor extremely rare.
The 54% Pathogenic fraction is similar to other "X → P" substitutions in companion per-target-AA analyses (e.g., Arg → Pro at 63.1%): proline introduction is uniformly Pathogenic-enriched across reference AAs.
3.5 Mean relative positions are similar across pairs
All 7 His-derived pairs have mean relative position 0.47–0.53 (close to uniform 0.50). There is no per-pair position bias for His-reference Pathogenic variants. His residues are uniformly distributed along human proteins, and the per-pair Pathogenic-fraction differences are not driven by position.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 ClinVar curatorial bias
His Pathogenic variants are over-reported in well-studied disease genes that contain catalytic or metal-coordinating His residues (e.g., metallopeptidases, carbonic anhydrases, kinases with His regulatory residues). The per-pair Pathogenic fractions therefore partly reflect curation focus on these gene families rather than a generic His-pathogenicity rule across all genes.
4.3 Codon-mutability not normalized
His has 2 codons (CAT, CAC). The per-target-AA mutational rates differ across the 7 alt AAs reported. H → R, H → Y, H → Q, H → N are achieved through CAY → CGY/TAY/CAA(G)/AAY single-nucleotide transitions which are common; H → P, H → D, H → L require less-common transitions or 2-step paths. The high-N pairs (H → R, H → Y, H → Q) reflect both biological tolerance and codon-distance accessibility.
4.4 Per-isoform first-element AA
We use the first finite element of dbnsfp.aa.ref and dbnsfp.aa.alt. ~5% per-isoform mismatch.
4.5 N-threshold sensitivity
We use ≥100 total per pair. His-derived substitutions with < 100 records (H → A, H → S, H → V, H → I, H → T, H → F, H → C, H → G, H → K, H → M, H → W) are not analyzed. Most are 2-step codon transitions and are infrequent.
4.6 Wilson CI assumes binomial sampling
Per-pair counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.7 ACMG-PP3/BP4 partial circularity
ClinVar Pathogenic / Benign labels are partly predictor-derived (PolyPhen / SIFT scores used as PP3 evidence). Some per-pair fractions reflect predictor-curator co-variance.
5. Implications
- Among 7 His-derived substitution pairs, H → P is the most Pathogenic-enriched at 54.0% (Wilson CI [49.7, 58.2]) — driven by proline's helix-breaking property.
- H → Q is the least Pathogenic-enriched at 22.5% [20.1, 25.1] — a chemistry-conservative polar-amide substitution.
- The 2.4× per-target-AA range within His-reference demonstrates substantial chemistry-driven variation in pathogenicity priors.
- For variant-prioritization pipelines: per-target-AA priors within His should be applied; H → P ~54%, H → Q ~22.5%.
- The H → R basic-to-basic conservative substitution at 27.4% Pathogenic is consistent with the broader pattern that within-chemistry-class substitutions are well-tolerated.
6. Limitations
- Stop-gain excluded (§4.1).
- ClinVar curatorial bias (§4.2) toward catalytic / metal-coordinating gene families.
- No codon-mutability normalization (§4.3).
- Per-isoform first-element AA (§4.4).
- N-threshold ≥ 100 (§4.5) excludes 2-step-codon-distance pairs.
- ACMG-PP3 partial circularity (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~60 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-target-AA counts, P-fractions, Wilson 95% CIs, mean relative positions. - Verification mode: 6 machine-checkable assertions: (a) all P-fractions in [0, 1]; (b) Wilson CIs contain the point estimate; (c) all 7 reported pairs have N ≥ 100; (d) H→P P-fraction > 0.5; (e) H→Q P-fraction < 0.3; (f) sample sizes match input file contents.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- MacArthur, J. W., & Thornton, J. M. (1991). Influence of proline residues on protein conformation. J. Mol. Biol. 218, 397–412.
- Hodgkin, D. C. (1949). The X-ray crystallographic study of compounds of biochemical interest. Annu. Rev. Biochem. 18, 295–322. (Histidine metal-coordination structural reference.)
- Stryer, L., Berg, J. M., & Tymoczko, J. L. (2002). Biochemistry. 5th edition. (Histidine catalytic-triad reference.)
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.