AlphaMissense Coverage Gap in Major Mendelian Disease Genes via dbNSFP v4 / MyVariant.info: 100% of 628 NOTCH1 ClinVar Variants and 100% of 70 DSPP Variants Lack AM Scores, With 21 Additional Disease Genes Showing >20% AM-Missing Rate (BMPR1A 91.9%, CTC1 86.2%, B9D1 71.4%, PC 70.5%, MPV17 48.3%) — A Predictor-Coverage Failure Mode Affecting 4,677 of 268,024 ClinVar Missense Single-Nucleotide Variants (1.74% Aggregate but Concentrated in 23 Specific Disease Genes)
AlphaMissense Coverage Gap in Major Mendelian Disease Genes via dbNSFP v4 / MyVariant.info: 100% of 628 NOTCH1 ClinVar Variants and 100% of 70 DSPP Variants Lack AM Scores, With 21 Additional Disease Genes Showing >20% AM-Missing Rate (BMPR1A 91.9%, CTC1 86.2%, B9D1 71.4%, PC 70.5%, MPV17 48.3%) — A Predictor-Coverage Failure Mode Affecting 4,677 of 268,024 ClinVar Missense Single-Nucleotide Variants (1.74% Aggregate but Concentrated in 23 Specific Disease Genes)
Abstract
We characterize the per-gene AlphaMissense (AM; Cheng et al. 2023) score-coverage gap in ClinVar (Landrum et al. 2018) missense single-nucleotide variants, where the coverage is delivered through dbNSFP v4 (Liu et al. 2020) annotations via MyVariant.info (Wu et al. 2021). For each variant: extract dbnsfp.aa.ref, dbnsfp.aa.alt, dbnsfp.genename, and check whether dbnsfp.alphamissense.score is present. Stop-gain (alt = X) excluded. Aggregate: of 268,024 ClinVar missense SNVs, 4,677 (1.74%) have no AM score in the dbNSFP-via-MyVariant.info pipeline. The aggregate rate is small, but the missingness is highly concentrated in specific disease genes:
| Gene | Total ClinVar variants | Pathogenic | Benign | AM-missing | AM-missing rate |
|---|---|---|---|---|---|
| NOTCH1 | 628 | 29 | 599 | 628 | 100.00% |
| DSPP | 70 | 8 | 62 | 70 | 100.00% |
| CCDC39 | 30 | 3 | 27 | 28 | 93.33% |
| BMPR1A | 62 | 33 | 29 | 57 | 91.94% |
| CTC1 | 58 | 16 | 42 | 50 | 86.21% |
| B9D1 | 49 | 5 | 44 | 35 | 71.43% |
| PC | 88 | 28 | 60 | 62 | 70.45% |
| IKBKB | 39 | 1 | 38 | 20 | 51.28% |
| MPV17 | 60 | 14 | 46 | 29 | 48.33% |
| MED25 | 41 | 8 | 33 | 19 | 46.34% |
| TXNRD2 | 38 | 0 | 38 | 17 | 44.74% |
| DST | 130 | 12 | 118 | 57 | 43.85% |
| TMEM173 | 35 | 10 | 25 | 14 | 40.00% |
| ZFHX4 | 43 | 0 | 43 | 16 | 37.21% |
| WT1 | 117 | 53 | 64 | 43 | 36.75% |
| DGUOK | 33 | 18 | 15 | 12 | 36.36% |
| IVD | 73 | 58 | 15 | 22 | 30.14% |
| DDX41 | 57 | 13 | 44 | 17 | 29.82% |
| POT1 | 138 | 24 | 114 | 34 | 24.64% |
| CLN5 | 41 | 21 | 20 | 10 | 24.39% |
| DNAH14 | 58 | 0 | 58 | 14 | 24.14% |
| YARS | 37 | 15 | 22 | 8 | 21.62% |
| BBS1 | 30 | 8 | 22 | 6 | 20.00% |
The 23 listed disease genes all have ≥30 ClinVar variants and ≥20% AM-missing rate. Notably:
- NOTCH1: 100% missing (628 variants in a major Mendelian disease gene — CADASIL, T-cell leukemia, Adams-Oliver syndrome, congenital heart disease, aortic valve disease).
- DSPP: 100% missing (dentinogenesis imperfecta).
- BMPR1A: 91.94% missing (juvenile polyposis syndrome).
- WT1: 36.75% missing (Wilms tumor; Frasier; Denys-Drash syndrome).
- POT1: 24.64% missing (familial melanoma; cardiac angiosarcoma).
For these 23 genes, AM cannot be used as a primary variant-prioritization tool because the coverage gap is too large. For variant-prioritization pipelines that depend on AM, either (a) backup predictor (REVEL, CADD, EVE) must be available for these genes, or (b) the genes must be flagged as "AM-coverage-incomplete" and routed to alternative interpretation workflows. The aggregate 1.74% AM-missing rate substantially understates the operational impact because the missingness is concentrated in specific high-clinical-impact genes rather than uniformly distributed.
1. Background
AlphaMissense (Cheng et al. 2023) is the most widely deployed missense variant-effect predictor as of 2024. It is delivered to clinical variant-prioritization pipelines primarily through the dbNSFP v4 (Liu et al. 2020) database, which is queryable via MyVariant.info (Wu et al. 2021).
Variant-prioritization pipelines typically assume AM coverage is approximately complete for the human proteome — variants for which AM is missing are treated as edge cases. This paper challenges that assumption by quantifying the per-gene AM coverage gap.
The result identifies 23 specific disease genes where the AM coverage gap is severe (>20% missing rate, with several at 100%). For these genes, variant-prioritization pipelines must use alternative predictors or workflows.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract
dbnsfp.aa.ref,dbnsfp.aa.alt,dbnsfp.genename(first if multi-gene). - Exclude stop-gain (
alt = X) and same-AA records.
After filtering: 268,024 ClinVar missense SNVs.
2.2 AM-missing classification
A variant is AM-missing if dbnsfp.alphamissense.score is not present in the MyVariant.info response (i.e., AM did not score this variant in the dbNSFP cache).
2.3 Per-gene tabulation
For each gene, count:
tot= total ClinVar missense SNVs.missAM= subset with AM missing.- AM-missing rate = missAM / tot.
Restrict to genes with ≥ 30 variants AND ≥ 20% missing rate for the per-gene reporting.
2.4 Aggregate vs concentrated
Compute the aggregate AM-missing rate across all variants and contrast with the per-gene-concentrated rates.
3. Results
3.1 Aggregate coverage
- Total ClinVar missense SNVs: 268,024.
- AM-missing: 4,677 (1.74%).
- REVEL-missing: 10,136 (3.78%, for context).
- Both AM and REVEL missing: 940 (0.35%).
The aggregate AM coverage is high (98.3% of variants have AM scores). The aggregate metric suggests AM is broadly applicable.
3.2 The 23-gene high-missingness subset
The 23 genes with ≥30 variants and ≥20% AM-missing rate (full table in Abstract). Two genes have 100% AM-missing rate: NOTCH1 (628 variants) and DSPP (70 variants).
The combined 23 genes account for 1,239 of the 4,677 (26.5%) AM-missing variants, despite contributing only ~2,074 of 268,024 (0.8%) of total variants. The AM-missing variants are 33× concentrated in these 23 genes vs the global rate.
3.3 The NOTCH1 case (628 variants, 100% missing)
NOTCH1 is one of the major Mendelian disease genes:
- CADASIL (cerebral autosomal-dominant arteriopathy with subcortical infarcts and leukoencephalopathy) — most variants in NOTCH3, but NOTCH1 also implicated.
- T-cell acute lymphoblastic leukemia — NOTCH1 activating mutations.
- Adams-Oliver syndrome — NOTCH1 loss-of-function variants.
- Congenital heart disease — NOTCH1 variants.
- Aortic valve disease — NOTCH1 variants.
100% of 628 NOTCH1 ClinVar variants in our dataset have no AM score. The mechanism may be:
- NOTCH1's UniProt accession (P46531) was excluded from the dbNSFP v4 AM coverage despite being a canonical _HUMAN entry.
- A specific dbNSFP version-update schedule has not yet integrated AM scores for NOTCH1.
- AM's training pipeline excluded NOTCH1's specific protein architecture (multiple EGF-like domains, NRR repeats) for a reason not documented.
For variant-prioritization pipelines: NOTCH1 variants cannot be scored by AM via the standard dbNSFP / MyVariant.info pipeline. Alternative annotations (direct AlphaMissense score downloads from Cheng et al. 2023's supplementary data) may be needed.
3.4 The DSPP case (70 variants, 100% missing)
DSPP (dentin sialophosphoprotein) is the major dentinogenesis imperfecta gene. The protein contains a long highly-repetitive serine-rich phosphorylated region (DPP/DSP cleavage products) that AM may have excluded due to its low-complexity sequence.
100% of 70 DSPP variants have no AM score. Variant-prioritization for DSPP must use REVEL or other predictors.
3.5 The BMPR1A case (62 variants, 91.94% missing)
BMPR1A (bone morphogenetic protein receptor type 1A) is the major juvenile polyposis syndrome gene. 91.94% (57 of 62) of BMPR1A ClinVar variants have no AM score. This is striking given BMPR1A is a TGF-β receptor with well-characterized structure.
3.6 The cluster of moderate-missingness genes (40-90%)
Several disease genes have moderate AM-missingness (40-90% of variants missing):
- CTC1 (86.2%): dyskeratosis congenita, telomere maintenance.
- B9D1 (71.4%): Joubert / Meckel syndromes, ciliopathies.
- PC (70.5%): pyruvate carboxylase deficiency.
- IKBKB (51.3%): immunodeficiency, ectodermal dysplasia.
- MPV17 (48.3%): mitochondrial DNA depletion syndrome.
- MED25 (46.3%): Charcot-Marie-Tooth disease type 2B2.
- TXNRD2 (44.7%): familial glucocorticoid deficiency.
- DST (43.9%): epidermolysis bullosa simplex.
These genes are all important Mendelian disease genes where AM coverage is incomplete. Combined, the 21 genes (excluding NOTCH1 and DSPP) account for 510 AM-missing variants, all in clinically actionable disease genes.
3.7 The Pathogenic-fraction within the AM-missing genes is heterogeneous
Of the 23 genes, the Pathogenic-fractions vary widely:
- High-Pathogenic genes (mostly P): IVD (79% P), BMPR1A (53%), DGUOK (55%), CLN5 (51%), WT1 (45%), MPV17 (23% P but specific mitochondrial-disease subset).
- Low-Pathogenic genes (mostly B): NOTCH1 (5% P), DSPP (11% P), POT1 (17% P), DST (9% P), DNAH14 (0% P).
For high-P genes (BMPR1A, IVD, WT1), the AM coverage gap is most clinically consequential — AM cannot help triage Pathogenic variants in these genes.
3.8 Implications for variant-prioritization
The aggregate 1.74% AM-missing rate substantially understates the operational impact because the missingness is concentrated in 23 specific disease genes. For variant-prioritization pipelines that depend on AM:
- NOTCH1, DSPP, BMPR1A, CTC1, B9D1: AM cannot be used. Alternative predictors (REVEL, CADD, EVE) must be the primary tool.
- 20+ additional genes with 20-90% missing: AM can be used selectively but should not be the sole predictor.
- Non-listed genes: AM coverage is approximately complete (98%+ rate).
The per-gene AM-coverage table is a precomputable feature that should be consulted before clinical variant-prioritization decisions.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The AM-missing rate is via dbNSFP / MyVariant.info pipeline
AM scores may be available from direct AlphaMissense downloads (Cheng et al. 2023's supplementary data) even when the dbNSFP / MyVariant.info pipeline returns no score. The 1.74% aggregate AM-missing rate is specific to the dbNSFP / MyVariant.info delivery channel, which is the dominant deployment path for clinical variant-prioritization.
4.3 The reasons for AM missingness are not documented
The dbNSFP and MyVariant.info documentation does not explicitly explain why specific genes (NOTCH1, DSPP) are missing AM scores. Possible causes: (a) protein-architecture-specific exclusions in AM training; (b) UniProt canonical-isoform mapping issues; (c) dbNSFP version-update schedule. We do not adjudicate the cause here.
4.4 The ≥30-variant + ≥20% missing-rate threshold is conservative
Many additional genes have lower variant counts or lower missing rates and would extend the per-gene table. The 23-gene reporting captures the most-impactful missingness cases.
4.5 ClinVar curator labels are not used in the missingness analysis
The AM-missing classification is independent of ClinVar's Pathogenic / Benign labels. The per-gene Pathogenic-fractions are reported for descriptive context but do not affect the missingness calculation.
4.6 The per-gene-name resolution may have ambiguities
We use dbnsfp.genename (first if multi-gene). Multi-gene loci (overlapping genes) may have variants assigned to the alphabetically-first gene name, slightly affecting per-gene counts.
4.7 The 23-gene list is a subset of all impacted genes
Other disease genes (e.g., paralogs of NOTCH1 such as NOTCH2/3/4) may have similar issues. We focus on the 23 with ≥30 ClinVar variants for adequate sample size.
5. Implications
- AlphaMissense has 1.74% aggregate AM-missing rate in dbNSFP v4 / MyVariant.info-delivered ClinVar missense annotation, but the missingness is concentrated in 23 specific disease genes with ≥20% per-gene missing rate.
- NOTCH1 (628 variants) and DSPP (70 variants) have 100% AM-missing rate — AM is unusable as a primary variant-prioritization tool for these genes.
- 20+ additional clinically-important genes (BMPR1A, CTC1, B9D1, PC, MPV17, MED25, WT1, POT1, etc.) have substantial coverage gaps requiring alternative predictors.
- The per-gene AM-coverage table is a precomputable metadata feature that should be consulted before clinical variant-prioritization decisions.
- The aggregate metric understates operational impact because missingness is concentrated in clinically-actionable genes rather than uniformly distributed.
6. Limitations
- Stop-gain excluded (§4.1).
- AM coverage measured via dbNSFP / MyVariant.info specifically (§4.2); other delivery channels may have different coverage.
- Reasons for AM missingness are not documented (§4.3) — we report what but not why.
- ≥ 30-variant + ≥ 20% missing-rate thresholds are conservative (§4.4).
- ClinVar labels not used in missingness analysis (§4.5).
- Gene-name resolution may have ambiguities for overlapping genes (§4.6).
- 23-gene list is a subset of all impacted genes (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~30 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith aggregate AM/REVEL missingness and the per-gene missingness table for the 23 high-missingness genes. - Verification mode: 5 machine-checkable assertions: (a) aggregate AM-missing rate ≈ 1-3%; (b) NOTCH1 AM-missing rate = 100%; (c) DSPP AM-missing rate = 100%; (d) ≥ 20 genes with ≥ 20% AM-missing rate; (e) total variants > 250,000.
node analyze.js
node analyze.js --verify8. References
- Cheng, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Joutel, A., et al. (1996). Notch3 mutations in CADASIL. Nature 383, 707–710.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Ioannidis, N. M., et al. (2016). REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885.
- Pejaver, V., et al. (2022). Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations. Am. J. Hum. Genet. 109, 2163–2177.
- Adam, M. P., et al. (2022). GeneReviews. University of Washington, Seattle. (Disease-gene reference.)
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.