Directional Asymmetry in ClinVar Pathogenicity of Reverse Amino-Acid Substitutions: 66.7% of 75 Well-Sampled Reciprocal AA-Pairs Show Non-Overlapping Wilson 95% CIs Between Forward and Reverse Direction (Median 12.68-Percentage-Point Gap; Maximum 47-pp for Met↔Arg) — L→P 66.23% vs P→L 20.14% Pathogenic Across 12,046 Variants Demonstrates the Helix-Breaker Loss-of-Function Asymmetry
Directional Asymmetry in ClinVar Pathogenicity of Reverse Amino-Acid Substitutions: 66.7% of 75 Well-Sampled Reciprocal AA-Pairs Show Non-Overlapping Wilson 95% CIs Between Forward and Reverse Direction (Median 12.68-Percentage-Point Gap; Maximum 47-pp for Met↔Arg) — L→P 66.23% vs P→L 20.14% Pathogenic Across 12,046 Variants Demonstrates the Helix-Breaker Loss-of-Function Asymmetry
Abstract
We test whether the Pathogenic-fraction of an amino-acid substitution depends on the direction of substitution (i.e., whether P-fraction(A→B) ≈ P-fraction(B→A)) on the full ClinVar P + B missense subset (76,994 Pathogenic + 191,030 Benign single-nucleotide variants with valid amino-acid annotation in dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021); stop-gain alt = X excluded). For each of the 190 unordered AA-pairs {A, B} with A ≠ B, we compute the per-direction P-fraction with Wilson 95% CI and test whether the forward and reverse direction CIs overlap. Result: directional asymmetry is the rule, not the exception. Of the 75 unordered AA-pairs with both directions n ≥ 100 variants, 50 pairs (66.7%) have non-overlapping forward-vs-reverse Wilson 95% CIs. The median absolute P-fraction gap across directions is 12.68 percentage points; the maximum is 47 pp (M→R 77.31% vs R→M 30.25%). The largest gaps consistently follow the loss-of-function asymmetry: introducing a structurally-disruptive AA (Pro = helix-breaker; loss of Cys = lost disulfide) is far more Pathogenic than the reverse direction. L→P 66.23% vs P→L 20.14% (12,046 total variants) is the canonical helix-breaker example. C→S 57.90% vs S→C 18.96% is the disulfide-loss example. For variant-prioritization: per-pair P-fraction priors must be computed per-direction, not as unordered-pair averages. Aggregated AA-pair statistics (e.g., {L,P} = 32.65%) are an average across two very different directional cells (L→P 66.23% and P→L 20.14%) and substantially mislead per-variant priors.
1. Background
A common simplification in variant-effect summary statistics is to treat amino-acid substitutions as unordered pairs: the substitution L↔P or {L, P} is reported as a single statistic, averaging over both directions L→P and P→L. The implicit assumption is that the substitution direction does not strongly affect the functional consequence — i.e., that P-fraction(A→B) ≈ P-fraction(B→A).
This assumption is biologically suspect for several reasons:
- Loss-of-function asymmetry: introducing a "structurally disruptive" AA (e.g., Pro as a helix-breaker, glycine as a flexibility-introducer) is more functionally disruptive than the reverse direction (which removes the disruption).
- Functional-class-specific roles: cysteine residues participate in disulfide bridges and metal-coordination sites; losing a Cys (C→X) disrupts these; gaining a Cys (X→C) typically does not establish them de novo (because the partner Cys is also needed).
- Initiation-codon and termination-codon asymmetry: M→X may abrogate translation initiation; X→M may not establish initiation de novo.
This paper measures the magnitude of the directional asymmetry directly on the full ClinVar P + B missense subset.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract
dbnsfp.aa.refanddbnsfp.aa.alt(max-isoform if multiple). - Exclude stop-gain (
alt = X) and same-AA records.
After filtering: 268,024 missense SNVs (76,994 Pathogenic + 191,030 Benign) with valid AA annotation.
2.2 Per-direction cell tabulation
For each ordered AA-pair (ref, alt) with ref ≠ alt, count #Pathogenic and #Benign. Compute P-fraction = #P / (#P + #B). Compute Wilson 95% CI per cell (Brown et al. 2001).
2.3 Forward-vs-reverse comparison
For each unordered pair {A, B} with A ≠ B, compare the (A→B) cell to the (B→A) cell. Restrict to pairs with both directions n ≥ 100 variants to ensure adequate power for the CI overlap test.
For each compared pair: compute the CI overlap = min(CI_high_fwd, CI_high_rev) − max(CI_low_fwd, CI_low_rev). If overlap < 0, the two CIs are non-overlapping.
Tabulate the fraction of pairs with non-overlapping CIs, the median absolute P-fraction gap, and the top 15 largest-gap pairs.
3. Results
3.1 Aggregate directional asymmetry
- Total unordered AA-pairs analyzed (both directions n ≥ 100): 75.
- Pairs with non-overlapping forward-vs-reverse Wilson 95% CIs: 50 / 75 = 66.7%.
- Median absolute P-fraction gap across directions: 12.68 percentage points.
- Mean absolute P-fraction gap: 14.48 pp.
The aggregate finding: two-thirds of well-sampled AA-pairs exhibit statistically-significant directional asymmetry at the Wilson 95% level. The asymmetry is not a tail-of-distribution effect but a typical pattern.
3.2 Top 15 largest-gap pairs
| Forward (fwd) | Fwd N | Fwd P-fraction (CI) | Reverse (rev) | Rev N | Rev P-fraction (CI) | Gap (fwd − rev) |
|---|---|---|---|---|---|---|
| M→R | 551 | 77.31% [73.6, 80.6] | R→M | 119 | 30.25% [22.7, 39.0] | +47.06 pp |
| L→P | 3,909 | 66.23% [64.7, 67.7] | P→L | 8,137 | 20.14% [19.3, 21.0] | +46.09 pp |
| C→S | 867 | 57.90% [54.6, 61.1] | S→C | 1,139 | 18.96% [16.8, 21.3] | +38.94 pp |
| C→R | 1,529 | 68.15% [65.8, 70.4] | R→C | 7,175 | 32.53% [31.5, 33.6] | +35.62 pp |
| K→M | 168 | 32.74% [26.1, 40.2] | M→K | 454 | 68.06% [63.6, 72.2] | −35.32 pp |
| L→Q | 408 | 56.62% [51.8, 61.3] | Q→L | 373 | 21.98% [18.1, 26.5] | +34.63 pp |
| S→Y | 587 | 35.09% [31.3, 39.0] | Y→S | 319 | 68.97% [63.7, 73.8] | −33.87 pp |
| P→R | 1,763 | 31.93% [29.8, 34.1] | R→P | 1,641 | 63.07% [60.7, 65.4] | −31.14 pp |
| I→K | 100 | 64.00% [54.2, 72.7] | K→I | 121 | 33.88% [26.1, 42.7] | +30.12 pp |
| A→P | 1,397 | 44.31% [41.7, 46.9] | P→A | 1,728 | 15.80% [14.2, 17.6] | +28.51 pp |
| R→W | 5,691 | 35.27% [34.0, 36.5] | W→R | 948 | 62.66% [59.5, 65.7] | −27.39 pp |
| H→P | 515 | 53.98% [49.7, 58.2] | P→H | 691 | 26.92% [23.7, 30.3] | +27.06 pp |
| E→V | 504 | 40.48% [36.3, 44.8] | V→E | 358 | 65.36% [60.3, 70.1] | −24.89 pp |
| L→M | 468 | 15.60% [12.6, 19.2] | M→L | 1,022 | 40.31% [37.3, 43.4] | −24.71 pp |
| F→S | 958 | 57.41% [54.3, 60.5] | S→F | 1,813 | 32.87% [30.7, 35.1] | +24.54 pp |
All 15 listed pairs have non-overlapping Wilson 95% CIs between fwd and rev direction.
3.3 The helix-breaker asymmetry: L→P vs P→L
The largest single asymmetry by sample size is L→P vs P→L:
- L→P (Leu → Pro, introducing the helix-breaker): 66.23% Pathogenic across 3,909 variants.
- P→L (Pro → Leu, removing the helix-breaker): 20.14% Pathogenic across 8,137 variants.
- Gap: +46.09 pp.
Mechanism: proline lacks the backbone amide hydrogen needed for α-helix hydrogen-bonding and has a constrained backbone dihedral. Introducing Pro into a hydrophobic α-helical position breaks the helix and disrupts protein folding. Removing Pro in the reverse direction restores normal helical capacity and is typically tolerated.
The same pattern is seen for A→P vs P→A (44.31% vs 15.80%; +28.51 pp gap) and H→P vs P→H (53.98% vs 26.92%; +27.06 pp gap). All "introduce-Pro" directions are highly Pathogenic; all "remove-Pro" directions are tolerated.
3.4 The disulfide-loss asymmetry: C→S vs S→C
The cysteine-loss asymmetry is the second cleanest case:
- C→S (Cys → Ser, losing the disulfide-bond capacity): 57.90% Pathogenic across 867 variants.
- S→C (Ser → Cys, gaining a Cys but typically without partner): 18.96% Pathogenic across 1,139 variants.
- Gap: +38.94 pp.
Mechanism: cysteine residues participate in disulfide bridges that stabilize tertiary protein structure; losing a Cys breaks the disulfide and destabilizes the protein. Gaining a Cys is typically tolerated because the new Cys lacks a partner Cys to form a bridge.
A related case: C→R vs R→C (68.15% vs 32.53%; +35.62 pp gap). Cys → Arg loses both the disulfide capacity and introduces a charged side-chain — highly disruptive. Arg → Cys removes the charge and adds a potentially-reactive thiol — less disruptive on average.
3.5 The asymmetry is bidirectional
Of the 50 non-overlapping pairs:
- 27 pairs have positive gap (forward direction A→B is more Pathogenic than reverse).
- 23 pairs have negative gap (forward is less Pathogenic than reverse).
The asymmetry is bidirectional: there is no universal "the alphabetically-first AA is always more Pathogenic-source" pattern. The direction of asymmetry is chemistry-class-driven: introducing structurally-disruptive AAs (Pro, Cys-loss, large-hydrophobic-into-charged-region) is more Pathogenic than the reverse direction.
3.6 Implications for variant-prioritization priors
Aggregated unordered-pair statistics substantially mislead per-variant priors. For example: the unordered pair {L, P} has an aggregate P-fraction of:
- (3,909 × 0.6623 + 8,137 × 0.2014) / (3,909 + 8,137) = 0.3504 = 35.04%.
But the forward direction L→P has P-fraction 66.23% — nearly 2× the aggregate. The reverse direction P→L has P-fraction 20.14% — barely over half the aggregate.
A per-variant prior derived from the aggregate {L, P} statistic would substantially under-call L→P variants (true prior 66%, aggregate prior 35%, factor of 2 under-call) and over-call P→L variants (true prior 20%, aggregate prior 35%, factor of 1.75 over-call).
For variant-prioritization pipelines: per-pair P-fraction priors should always be computed per-direction, not as unordered-pair averages.
3.7 Modern predictors implicitly handle directionality
AlphaMissense and REVEL are per-variant predictors that naturally capture directionality through learned features (the per-variant context includes the ref AA, alt AA, and protein position). The directional-asymmetry signal we report here is therefore implicitly captured by modern deep-learning predictors. The point of this paper is that simple unordered-pair summary statistics, which are still common in variant-effect literature, mislead by averaging over very different per-direction P-fractions.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The Codon-reachability asymmetry
Forward and reverse AA substitutions are not always reachable from the same single-nucleotide change. For example, L → P is reachable via CTN → CCN (single position-2 nucleotide change); P → L is reachable via CCN → CTN (also single position-2 change). The two directions involve symmetric nucleotide changes (T↔C, a transition).
But for some pairs, only one direction is reachable from a single-nucleotide change. The codon-reachability asymmetry is partial; we have not computed a full per-codon-degeneracy correction here.
4.3 The CpG-hotspot effect biases R-involving pairs
Arginine is encoded by CGN codons (CpG-containing); R-positions are mutational hotspots due to CpG-deamination. R-involving pairs (R→C, R→W, R→Q, R→H) have inflated population-frequency Benign counts in the reverse direction (X→R is rarer because X-codons are not CpG-rich). This contributes to the asymmetry observed in pairs like R→C (32.53%) vs C→R (68.15%).
The CpG-hotspot effect does not undermine the asymmetry finding; it explains the mechanism for one subset of pairs.
4.4 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported asymmetries reflect curator-assignment patterns and may include some classification noise.
4.5 The n ≥ 100 threshold for both directions is conservative
We require n ≥ 100 in both forward and reverse cells to ensure adequate Wilson CI precision. Lower thresholds (e.g., n ≥ 30) would include more pairs but with wider CIs. Of the 190 unordered pairs, 75 satisfy the n ≥ 100 threshold; the remaining 115 are dominated by rare AA-pairs where one or both directions has < 100 variants.
4.6 Wilson CI assumes independent draws
Wilson 95% CI is appropriate at our cell sizes (smallest n ≥ 100; largest n > 8,000). All asymptotic conditions are satisfied.
4.7 The ascertainment-vs-mechanism distinction
The directional asymmetry combines two distinct mechanisms: (a) biological mechanism (introducing a disruptive AA is more deleterious than removing it); (b) ascertainment bias (CpG-hotspot-related differential variant-frequency in population databases). We do not separate these mechanisms quantitatively here; the reported asymmetries reflect the combination.
5. Implications
- Two-thirds (66.7%) of well-sampled AA-pair substitutions exhibit statistically-significant directional asymmetry in ClinVar Pathogenic-fraction (Wilson 95% CI overlap test).
- The median absolute directional gap is 12.68 percentage points; the maximum is 47 pp (M→R vs R→M).
- The largest gaps follow the loss-of-function chemistry pattern: introducing Pro (helix-breaker), losing Cys (disulfide-loss), and other structurally-disruptive substitutions are more Pathogenic than the reverse direction.
- For variant-prioritization pipelines: per-pair P-fraction priors must be computed per-direction, not as unordered-pair averages, because the aggregate misleads by factor 1.5–2× in either direction.
- Modern deep-learning predictors implicitly handle directionality; the warning here is for simple unordered-pair summary statistics that are still common in variant-effect literature.
6. Limitations
- Stop-gain excluded (§4.1).
- Codon-reachability asymmetry is partial (§4.2).
- CpG-hotspot effect confounds R-involving pairs (§4.3).
- ClinVar curator labels are not gold-standard (§4.4).
- n ≥ 100 threshold restricts to 75 of 190 pairs (§4.5).
- Wilson CI assumes independent draws (§4.6) — satisfied at our cell sizes.
- Ascertainment vs mechanism not separated (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~50 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-direction cell counts, Wilson 95% CIs, fwd-rev CI-overlap test results, and the top-15 largest-gap pairs. - Verification mode: 5 machine-checkable assertions: (a) 75 pairs analyzed; (b) >60% pairs non-overlapping; (c) L→P P-fraction > 60%; (d) P→L P-fraction < 25%; (e) median abs gap > 10 pp.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Cooper, D. N., & Krawczak, M. (1990). The mutational spectrum of single base-pair substitutions causing human genetic disease. Hum. Genet. 85, 55–74.
- MacArthur, D. G., & Tyler-Smith, C. (2010). Loss-of-function variants in the genomes of healthy humans. Hum. Mol. Genet. 19, R125–R130.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Cheng, J., et al. (2023). AlphaMissense. Science 381, eadg7492.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.