57.45% of ClinVar Pathogenic Variants Receive Zero AlphaMissense Scores in dbNSFP Annotation Versus Only 2.44% of Benign — A 23.56× Pathogenic-to-Benign Share Ratio for the No-AM-Score Subset That Quantifies AlphaMissense's Missense-Only Scoring Boundary
57.45% of ClinVar Pathogenic Variants Receive Zero AlphaMissense Scores in dbNSFP Annotation Versus Only 2.44% of Benign — A 23.56× Pathogenic-to-Benign Share Ratio for the No-AM-Score Subset That Quantifies AlphaMissense's Missense-Only Scoring Boundary
Abstract
We compute the per-variant AlphaMissense (AM; Cheng et al. 2023) score-array length distribution for ClinVar Pathogenic + Benign single-nucleotide variants (Landrum et al. 2018), with Wilson 95% confidence intervals (Wilson 1927) on the per-class shares. Method: for each of 178,509 Pathogenic and 194,418 Benign variants annotated by dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021), count the number of valid (non-null) AM scores in the dbnsfp.alphamissense.score field per variant. Variants with no AM score (null or empty array) get N_AM = 0; variants with AM scores from one or more transcript isoforms get N_AM ≥ 1. Result: 102,557 of 178,509 Pathogenic variants (57.45%) receive zero AM scores, vs only 4,741 of 194,418 Benign variants (2.44%) — a Pathogenic-to-Benign share ratio of 23.56× (Wilson 95% CIs on per-class shares: P 57.22–57.68; B 2.37–2.51 — non-overlapping by ~55 percentage points). The mechanism is well-established: AlphaMissense is a missense-specific predictor and does not score stop-gain (aa.alt = X), splice-region, intron, or non-coding variants. ClinVar Pathogenic variants are heavily enriched in stop-gain (~36% of Pathogenic missense-classified records have aa.alt = X; the remainder unscored by AM may be splice / intron / non-coding submissions); ClinVar Benign variants are predominantly population-derived missense substitutions, which AM scores. The 23.56× share-ratio quantifies the size of AlphaMissense's missense-only scoring boundary for any pipeline that aggregates ClinVar variants by class. The methodological consequence: any benchmark of AlphaMissense on a "ClinVar Pathogenic vs Benign" set must filter to the AM-scored subset (N_AM ≥ 1); the unfiltered Pathogenic set is 2.4× larger than the AM-scored Pathogenic set (178,509 vs 75,952), so unfiltered AUC computations would be biased. For variant-effect-predictor evaluation: explicit reporting of the per-variant N_AM is recommended; benchmarks should report the AM-scoring-coverage rate per class as a methodological audit number.
1. Background
AlphaMissense (Cheng et al. 2023) is a deep-learning predictor of missense pathogenicity, designed to score single-amino-acid substitutions in proteins. The predictor outputs a score in [0, 1] for each missense variant in each transcript isoform of the target gene; for a single genomic variant, the dbNSFP v4 (Liu et al. 2020) annotation aggregates AM scores across all isoforms in which the variant is missense.
AM does not score:
- Stop-gain (
aa.alt = X) — by design (AM is missense-specific). - Splice-region, intron, non-coding variants — these are not missense by definition.
- Variants without a valid transcript-isoform missense interpretation — e.g., a SNV that is missense in 0 isoforms because the SNV falls in a UTR.
The per-variant N_AM (number of AM scores) field therefore directly reflects whether AM evaluates the variant. This paper measures the per-class N_AM distribution and quantifies the no-AM-score subset.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract
dbnsfp.alphamissense.score. N_AM =Array.isArray(score) ? score.filter(x => x != null).length : (score != null ? 1 : 0).
2.2 Per-class N_AM distribution
Bin variants by N_AM ∈ {0, 1, 2, ..., 50}. Per bin:
n_P,n_B= per-class count.P_share = n_P / total_P,B_share = n_B / total_B(share within class).P/B share ratio = P_share / B_share.
2.3 Wilson 95% CI
Per-class share p̂ = k/n, Wilson 95% CI (Wilson 1927; Brown et al. 2001):
CI = (p̂ + z²/(2n) ± z·√(p̂(1-p̂)/n + z²/(4n²))) / (1 + z²/n)with z = 1.96.
3. Results
3.1 The headline N_AM = 0 subset
| Metric | Pathogenic | Benign |
|---|---|---|
| Variants with N_AM = 0 (no AM score) | 102,557 | 4,741 |
| Per-class share (Wilson 95% CI) | 57.45% [57.22, 57.68] | 2.44% [2.37, 2.51] |
| Pathogenic-to-Benign share ratio | — | 23.56× |
The Wilson 95% CIs on the per-class shares are non-overlapping by ~55 percentage points. The share-ratio of 23.56× is robust to all sampling-noise considerations.
3.2 The full N_AM distribution (selected bins)
| N_AM | n_P | n_B | %P | %B | P/B share ratio |
|---|---|---|---|---|---|
| 0 | 102,557 | 4,741 | 57.45% | 2.44% | 23.56× |
| 1 | 20,045 | 68,435 | 11.23% | 35.20% | 0.32× |
| 2 | 19,691 | 50,404 | 11.03% | 25.93% | 0.43× |
| 3 | 11,454 | 27,879 | 6.42% | 14.34% | 0.45× |
| 4 | 8,847 | 17,346 | 4.96% | 8.92% | 0.56× |
| 5 | 4,530 | 8,934 | 2.54% | 4.60% | 0.55× |
| 6 | 3,342 | 4,767 | 1.87% | 2.45% | 0.76× |
| 7 | 2,650 | 3,253 | 1.48% | 1.67% | 0.89× |
| 8–11 | (smaller bins) | (smaller bins) | (subpercent) | (subpercent) | (0.5–1.0×) |
| 12 | 473 | 369 | 0.26% | 0.19% | 1.40× |
| 16 | 248 | 219 | 0.14% | 0.11% | 1.23× |
| 33 | 18 | 9 | 0.01% | 0.005% | 2.18× |
3.3 The N_AM ≥ 1 subset
For variants with at least one AM score (N_AM ≥ 1):
- 75,952 Pathogenic variants (42.55% of all Pathogenic) — the AM-scoreable Pathogenic subset.
- 189,677 Benign variants (97.56% of all Benign) — the AM-scoreable Benign subset.
- The class-balance shifts from 0.92:1 (P:B for the full corpus) to 0.40:1 (P:B for the AM-scored subset).
This 2.3× class-balance shift has implications for AUC benchmarking: Mann-Whitney U AUC is invariant to class proportions, but threshold-based metrics (precision, recall, F1) are not. Any reported AUC for AlphaMissense on "ClinVar P vs B" must specify whether the denominator is the full ClinVar set or the AM-scored subset.
3.4 The mechanism
The N_AM = 0 subset for Pathogenic is heavily dominated by stop-gain variants (aa.alt = X). The dbNSFP v4 convention: stop-gain records receive an aa.alt = X annotation but no AlphaMissense score, because AM is missense-specific.
In our independent-substitution-class analyses (companion-internal counting), 36.4% of Pathogenic variants in our cache carry aa.alt = X. The remaining ~21% of N_AM = 0 Pathogenic (57.45% N_AM=0 minus 36% stop-gain) are likely:
- Splice-region or intron variants flagged as ClinVar-Pathogenic submissions (occasionally mis-classified as missense by upstream annotation).
- Variants in non-canonical isoforms not present in AM's training-set transcripts.
- Variants in genes that AM does not cover (small fraction).
3.5 The N_AM ≥ 1 distribution shape
For the N_AM ≥ 1 subset, P/B share ratio rises gradually from 0.32 at N_AM = 1 to ~0.89 at N_AM = 7, then fluctuates around 0.5–1.4 in the long tail (N_AM ≥ 8). The qualitative pattern: variants annotated to many transcript isoforms tend to be in well-curated genes (where both Pathogenic and Benign submissions are abundant), so the per-class P/B ratio approaches 1 in the high-N_AM bins.
4. Confound analysis
4.1 N_AM = 0 includes multiple variant-mechanism types
We do not distinguish stop-gain from splice-region from intron-non-coding within the N_AM = 0 subset. The 57.45% Pathogenic N_AM = 0 share is the joint contribution of all such variant types. Disambiguating them would require additional annotation (Sequence Ontology consequence terms), which is out of scope.
4.2 ClinVar curatorial bias
Pathogenic submissions are over-represented in ClinVar for stop-gain variants (ACMG-PVS1 evidence; Richards et al. 2015). The 23.56× P/B share-ratio for N_AM = 0 partly reflects this curator-encoded preference for stop-gain Pathogenic classification.
4.3 dbNSFP version dependency
The AM score-array per variant depends on the dbNSFP version. Different releases may include different transcript isoforms. The reported N_AM distribution is from the current MyVariant.info / dbNSFP cache.
4.4 Wilson CI assumes binomial sampling
Per-class N_AM=0 counts are binomial. Wilson 95% CI is appropriate (Brown et al. 2001).
4.5 The "missense-only" boundary is well-known
This paper does not claim discovery of AM's missense-only boundary; the boundary is published in Cheng et al. (2023). The contribution is the quantitative size (23.56× share-ratio; 102,557 absolute Pathogenic count) of the no-AM-score subset in a typical ClinVar-derived benchmark cache.
5. Implications
- 57.45% of ClinVar Pathogenic variants in the dbNSFP v4 / MyVariant.info cache receive no AlphaMissense score (Wilson 95% CI [57.22, 57.68]).
- Only 2.44% of Benign variants are unscored by AM (Wilson CI [2.37, 2.51]) — a 23.56× P-vs-B share ratio.
- For AlphaMissense AUC benchmarking on ClinVar: the AM-scoreable subset is 75,952 P + 189,677 B (P/B = 0.40:1) vs the full set 178,509 P + 194,418 B (P/B = 0.92:1). The class-balance shift is 2.3×.
- Pipelines should report per-variant N_AM as a coverage audit number alongside corpus-level AUC.
- The N_AM = 0 subset is dominantly stop-gain Pathogenic — interpretation requires a separate stop-gain-specific predictor, not AM.
6. Limitations
- N_AM = 0 includes multiple mechanisms (§4.1) — stop-gain, splice, intron, etc.
- ClinVar curatorial bias (§4.2) — ACMG-PVS1 weighting drives stop-gain Pathogenic submissions.
- dbNSFP version dependency (§4.3).
- The missense-only boundary is well-known (§4.5) — this paper quantifies it, does not discover it.
7. Reproducibility
- Script:
analyze.js(Node.js, ~50 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info (372,927 records).
- Outputs:
result.jsonwith per-N_AM counts, per-class shares, Wilson 95% CIs, P/B ratios. - Verification mode: 5 machine-checkable assertions: (a) Σ per-bin counts per class = total per class; (b) all per-bin shares in [0, 1]; (c) Wilson CIs contain the point estimate; (d) N_AM=0 P/B share ratio > 10; (e) sample sizes match input file contents.
node analyze.js
node analyze.js --verify8. References
- Cheng, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
- Abou Tayoun, A. N., et al. (2018). Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum. Mutat. 39, 1517–1524.
- Eilbeck, K., et al. (2005). The Sequence Ontology. Genome Biol. 6, R44.
- Ioannidis, N. M., et al. (2016). REVEL. Am. J. Hum. Genet. 99, 877–885.