Among 6 Cysteine-Reference Substitution Pairs in ClinVar Missense Variants With ≥100 Records: Cys→Trp Is the Most Pathogenic-Enriched (75.1% Pathogenic, Wilson 95% CI [70.4, 79.2]) and Cys→Ser Is the Least (57.9% [54.6, 61.1]) — A 1.30× Range Within a Uniformly High-Pathogenicity Cysteine Substitution Set
Among 6 Cysteine-Reference Substitution Pairs in ClinVar Missense Variants With ≥100 Records: Cys→Trp Is the Most Pathogenic-Enriched (75.1% Pathogenic, Wilson 95% CI [70.4, 79.2]) and Cys→Ser Is the Least (57.9% [54.6, 61.1]) — A 1.30× Range Within a Uniformly High-Pathogenicity Cysteine Substitution Set
Abstract
We compute the per-substitution-target-amino-acid Pathogenic fraction for the 6 Cysteine-reference (Cys, C) substitution pairs with ≥100 ClinVar missense single-nucleotide variants in the dbNSFP v4 (Liu et al. 2020) annotation of 372,927 ClinVar Pathogenic+Benign records (Landrum et al. 2018) returned by MyVariant.info (Wu et al. 2021), with Wilson 95% confidence intervals (Wilson 1927). Stop-gain (aa.alt = X) is explicitly excluded. Cysteine has the smallest per-target-AA Pathogenic-fraction range across reference AAs we have analyzed: only 6 (C → other) pairs exceed the ≥100-record threshold, and all 6 have Pathogenic fraction ≥ 0.57 — Cys substitutions are uniformly Pathogenic-enriched relative to the corpus-level baseline of ~28% Pathogenic. Result: per-target-AA Pathogenic fractions span a 1.30× range from 57.9% (C → S) to 75.1% (C → W): C→W 75.1% Wilson CI [70.4, 79.2]; C→F 69.3% [65.7, 72.6]; C→R 68.1% [65.8, 70.4]; C→G 64.6% [60.3, 68.7]; C→Y 63.8% [61.6, 66.0]; C→S 57.9% [54.6, 61.1]. The narrow per-pair range and uniformly-high Pathogenic fractions reflect the near-universal disruption caused by losing a cysteine: in folded proteins, Cys residues participate in disulfide bonds (Sevier & Kaiser 2002), metal coordination (Zn²⁺, Fe²⁺), or thiol-redox active sites; substitution of any other amino acid removes these functional roles. The chemistry of the alt AA modulates the magnitude (W > F > R > G > Y > S), but the range is tighter than for less-functionally-constrained reference AAs (Arg 4.2× range, Gly 2.2× range from companion analyses). The C → S substitution at 57.9% Pathogenic (the most-Benign Cys-derived pair) reflects that Ser is the closest amino acid to Cys in chemistry (sulfhydryl → hydroxyl), preserving partial H-bonding capacity at the cost of disulfide formation. For variant-prioritization pipelines: any C → X substitution carries a > 50% Pathogenic prior (the lowest C → S still at 57.9%), substantially above the corpus-baseline 28%.
1. Background
Cysteine (Cys, C) is a sulfur-containing amino acid with a unique reactive side chain (-CH₂-SH). In folded human proteins, Cys residues participate in three primary functional roles:
- Disulfide bond formation (-S-S-): the most common Cys role in extracellular and secreted proteins (Sevier & Kaiser 2002). Disulfides are stabilizing covalent crosslinks essential for tertiary fold.
- Metal coordination: Cys-rich motifs (CXXC, CCCC, zinc fingers) coordinate Zn²⁺, Fe²⁺, Cu²⁺.
- Active-site thiol chemistry: catalytic Cys residues in proteases (caspases, papain), kinases, and oxidoreductases.
All three roles require the specific Cys -SH side chain; substitution of any other amino acid removes the function. This paper measures the per-target-AA Pathogenic-fraction distribution for Cys-reference substitutions.
2. Method
Identical to companion per-AA-substitution-pair analysis: ClinVar missense (alt ≠ X) variants from MyVariant.info / dbNSFP v4. Restrict to ref = C; group by alt AA; require ≥100 total per pair. Wilson 95% CI on the per-pair Pathogenic fraction.
The "Mean rel pos" column reported in result.json is the per-pair mean of aa.pos / protein_length across all Pathogenic variants in the pair — a centrality summary of where in the protein these substitutions occur. We report it for completeness; for the Cys-reference set, all per-pair mean relative positions cluster around 0.43–0.48 (slight N-terminal skew), with no per-pair difference larger than 0.05. The mean-position values do not drive the per-pair P-fraction differences.
3. Results
3.1 Per-target-AA Pathogenic fraction (sorted descending)
| C → alt | n_P | n_B | total | Pathogenic fraction | Wilson 95% CI | Mean rel pos |
|---|---|---|---|---|---|---|
| C → W | 280 | 93 | 373 | 75.1% | [70.4, 79.2] | 0.451 |
| C → F | 469 | 208 | 677 | 69.3% | [65.7, 72.6] | 0.445 |
| C → R | 1,042 | 487 | 1,529 | 68.1% | [65.8, 70.4] | 0.476 |
| C → G | 314 | 172 | 486 | 64.6% | [60.3, 68.7] | 0.433 |
| C → Y | 1,190 | 675 | 1,865 | 63.8% | [61.6, 66.0] | 0.473 |
| C → S | 502 | 365 | 867 | 57.9% | [54.6, 61.1] | 0.443 |
3.2 The narrow per-pair range and uniformly-high Pathogenicity
Cys-reference Pathogenic fractions are uniformly above 57%, with the highest (C → W) reaching 75.1%. The 1.30× range (75.1 / 57.9) is narrower than for Arg-reference (4.2× range, 15.0% to 63.1% from companion) and Gly-reference (2.2× range, 28.9% to 63.7% from companion).
The mechanism: Cys is functionally constrained at most positions where it appears (disulfide bonds, metal coordination, active sites). Substitution of any other amino acid removes the disulfide / metal-coordination / catalytic role, with Pathogenic consequence. The chemistry of the alt AA modulates the magnitude but doesn't determine whether the substitution is consequential.
3.3 The chemistry-class ranking
Tier 1 — Most-Pathogenic Cys substitutions (P-fraction > 65%):
- C → W (75.1%): introduces large bulky aromatic side chain in place of small reactive thiol. Maximum volume increase among Cys-derived pairs.
- C → F (69.3%): introduces aromatic side chain. Less volume-disruptive than W but no thiol functional replacement.
- C → R (68.1%): introduces large basic side chain. Charge introduction at typically uncharged Cys positions.
Tier 2 — Mid-range Cys substitutions (P-fraction 60–65%):
- C → G (64.6%): introduces conformational flexibility (Gly is the smallest AA). Disrupts structural roles by removing rigidity.
- C → Y (63.8%): introduces aromatic ring with hydroxyl (intermediate between W and F in volume; preserves hydroxyl H-bonding).
Tier 3 — Most-Benign Cys substitution (P-fraction < 60%):
- C → S (57.9%): replaces sulfhydryl (-SH) with hydroxyl (-OH). Preserves H-bonding capacity, polar character, and approximate side-chain volume. The chemistry is the closest replacement for Cys among the 19 alternatives. Yet still 57.9% Pathogenic — Cys substitution is rarely benign even with the chemically-closest alt AA.
3.4 The C → S "least-Pathogenic" caveat
C → S at 57.9% Pathogenic is the most-Benign Cys-derived substitution but is still substantially above the corpus-baseline 28% Pathogenic fraction. Cys-reference variants are universally enriched in Pathogenic — the question is not "is this substitution likely Pathogenic" but "how strongly Pathogenic is it."
The -SH → -OH substitution preserves polarity and H-bonding but removes:
- Disulfide bond formation capability.
- Metal coordination (Cys-S binds Zn²⁺ ~10⁵× more strongly than Ser-O).
- Thiol-based catalysis.
For the ~58% of C → S variants that are Pathogenic, the position likely required one of these three functions. For the ~42% that are Benign, the Cys was likely a non-functional surface residue.
3.5 Mean relative positions are similar across pairs
All 6 Cys-derived pairs have mean relative position 0.43–0.48 (slight N-terminal skew). There is no per-pair position bias for Cys-reference Pathogenic variants. The N-terminal skew (~5% below the uniform 0.50) is consistent with the disulfide-bond-rich extracellular and secreted protein subset, which often has signal peptides or N-terminal domains containing the first disulfide bridges.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 ClinVar curatorial bias and gene-level overrepresentation
Cys-rich functional motifs are well-studied in disease genes: zinc fingers (KLF, GLI, ZNF families), EGF-like repeats (NOTCH1, fibrillin/FBN1), kringle domains, immunoglobulin-fold domains, and laminin-like domains. Many ClinVar Cys Pathogenic variants come from these gene families. The uniformly-high Cys-reference Pathogenic fractions therefore partly reflect the functional-motif curation focus rather than a generic Cys-pathogenicity rule across all genes. A complementary analysis stratified by gene-family (e.g., disulfide-bond-rich extracellular vs cytoplasmic) would refine the per-Cys-position interpretation.
4.3 Codon-mutability not normalized
Cys has 2 codons (TGT, TGC). The per-target-AA mutational rates differ across the 6 alt AAs reported. C → R is achieved through TGN → CGN single-nucleotide transitions which are common; C → W requires TGN → TGG. We report the raw P-fraction observed in ClinVar.
4.4 Per-isoform first-element AA
We use the first finite element of dbnsfp.aa.ref and dbnsfp.aa.alt. ~5% per-isoform mismatch.
4.5 Wilson CI assumes binomial sampling
Per-pair counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.6 ACMG-PP3/BP4 partial circularity
ClinVar Pathogenic / Benign labels are partly predictor-derived (PolyPhen / SIFT scores used as PP3 evidence). Some per-pair fractions reflect predictor-curator co-variance.
4.7 N-threshold sensitivity
We use ≥100 total per pair. Cys-derived substitutions with < 100 records (C → A, C → V, C → L, C → I, etc.) are not analyzed. These substitutions require 2-step codon transitions and are infrequent.
5. Implications
- Cys-reference Pathogenic fractions are uniformly above 57% across the 6 analyzed pairs, much higher than the corpus-baseline ~28%.
- C → W is the most Pathogenic Cys substitution at 75.1% (Wilson CI [70.4, 79.2]).
- C → S is the least Pathogenic Cys substitution at 57.9% [54.6, 61.1] — but still 2× the corpus baseline.
- The 1.30× per-target-AA range within Cys-reference is narrower than Arg (4.2×) and Gly (2.2×) — Cys-reference is uniformly functionally-constrained.
- For variant-prioritization pipelines: any C → X substitution should carry a > 50% Pathogenic prior; the per-target-AA chemistry modulates the magnitude (W > F > R > G > Y > S).
6. Limitations
- Stop-gain excluded (§4.1).
- Cys-rich functional-motif curation focus (§4.2) — uniform high Pathogenicity partly reflects research-bias toward disulfide-rich genes.
- No codon-mutability normalization (§4.3).
- Per-isoform first-element AA (§4.4).
- N-threshold ≥ 100 (§4.7) excludes 2-step-codon-distance pairs.
- ACMG-PP3 partial circularity (§4.6).
7. Reproducibility
- Script:
analyze.js(Node.js, ~60 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-target-AA counts, P-fractions, Wilson 95% CIs, mean relative positions. - Verification mode: 6 machine-checkable assertions: (a) all P-fractions in [0, 1]; (b) Wilson CIs contain the point estimate; (c) all 6 reported pairs have N ≥ 100; (d) C→W P-fraction > 0.7; (e) C→S P-fraction > 0.5; (f) sample sizes match input.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Sevier, C. S., & Kaiser, C. A. (2002). Formation and transfer of disulphide bonds in living cells. Nat. Rev. Mol. Cell Biol. 3, 836–847.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- Berg, J. M., & Shi, Y. (1996). The galvanization of biology: a growing appreciation for the roles of zinc. Science 271, 1081–1085. (Cys-Zn coordination reference.)
- Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862–864.