Asymmetric Predictor-Coverage Selection Bias in dbNSFP-Delivered ClinVar: REVEL-Only-Coverage Variants Have a 48.41% Pathogenic-Fraction (Wilson 95% CI [46.81, 50.01]) — 4.36× Higher Than AlphaMissense-Only-Coverage Variants at 11.09% ([10.47, 11.75]) — Documenting That REVEL Specifically Covers Disease Genes (NOTCH1, BMPR1A, WT1) Missed by AlphaMissense
Asymmetric Predictor-Coverage Selection Bias in dbNSFP-Delivered ClinVar Annotations: REVEL-Only-Coverage Variants Have a 48.41% Pathogenic-Fraction (1,809 of 3,737; Wilson 95% CI [46.81, 50.01]) — 4.36× Higher Than AlphaMissense-Only-Coverage Variants at 11.09% (1,020 of 9,196; [10.47, 11.75]) — Documenting That REVEL Specifically Covers Clinically-Actionable Disease Genes (NOTCH1, BMPR1A, WT1) Missed by AlphaMissense
Abstract
We characterize the per-variant predictor-coverage selection bias in the dbNSFP v4 (Liu et al. 2020) / MyVariant.info (Wu et al. 2021) ClinVar (Landrum et al. 2018) annotation pipeline. For each missense single-nucleotide variant (alt = X excluded; same-AA excluded), we classify based on whether AlphaMissense (AM; Cheng et al. 2023) score and REVEL (Ioannidis et al. 2016) score are present in the dbNSFP-delivered annotation, producing a 4-cell coverage matrix:
| Cell | N | % of total | Pathogenic | Benign | P-fraction | Wilson 95% CI |
|---|---|---|---|---|---|---|
| Both AM and REVEL missing | 940 | 0.35% | 257 | 683 | 27.34% | [24.59, 30.28] |
| AM-only (REVEL missing) | 9,196 | 3.43% | 1,020 | 8,176 | 11.09% | [10.47, 11.75] |
| REVEL-only (AM missing) | 3,737 | 1.39% | 1,809 | 1,928 | 48.41% | [46.81, 50.01] |
| Both present | 254,151 | 94.82% | 73,908 | 180,243 | 29.08% | [28.90, 29.26] |
Result: a striking 4.36× asymmetry between the two single-coverage cells. REVEL-only-coverage variants (where REVEL is present but AM is missing) have a 48.41% Pathogenic-fraction — 1.66× the global rate of 29.08%. AM-only-coverage variants (AM present, REVEL missing) have a 11.09% Pathogenic-fraction — 0.38× the global rate, substantially below baseline. The REVEL-only / AM-only Pathogenic-fraction ratio is 4.36× with non-overlapping Wilson 95% CIs (gap ~36 pp). Mechanism: the asymmetry documents a predictor-coverage selection bias where the two predictors prioritize different genes for scoring. The REVEL-only subset is dominated by NOTCH1 (628 variants), NEB (142), TTN (103), DSPP (70), BMPR1A (57), CTC1 (50), WT1 (43) — major Mendelian disease genes that AlphaMissense did not score in the dbNSFP delivery. The AM-only subset is dominated by ALMS1 (460 variants), GRIN2B (361), RECQL4 (165), SGSH (153), POLG (151), HNF1B (120), MAGEL2 (105), MYH3 (99) — also disease genes but with different curation patterns. The predictor-coverage selection asymmetry has clinical implications: variants missing one predictor's score should not be assumed to be Benign by default; the per-cell P-fraction reveals strong systematic bias depending on which predictor is missing. For variant-prioritization pipelines that use ensemble methods: missingness of one predictor is informative about Pathogenicity (the missingness pattern itself encodes prior).
1. Background
Modern variant-prioritization pipelines combine multiple per-variant predictors (AlphaMissense, REVEL, CADD, EVE, etc.) accessed through annotation databases like dbNSFP (Liu et al. 2020) via APIs like MyVariant.info (Wu et al. 2021). Predictor coverage is not uniform across variants: some variants have all predictors scored, some have only a subset.
The standard variant-prioritization pipeline assumes that predictor missingness is missing-at-random with respect to Pathogenicity. If this assumption holds, the per-variant prior on Pathogenicity is unaffected by which predictors are missing.
This paper tests the assumption by computing the per-coverage-cell Pathogenic-fraction on 268,024 ClinVar missense variants. The result demonstrates that predictor missingness is far from random: variants with REVEL-only coverage have 4.36× the Pathogenic-fraction of variants with AM-only coverage. The missingness pattern itself encodes substantial prior information about Pathogenicity.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract
dbnsfp.aa.ref,dbnsfp.aa.alt,dbnsfp.alphamissense.score,dbnsfp.revel.score,dbnsfp.genename. - Exclude stop-gain (
alt = X) and same-AA records.
2.2 Predictor-coverage classification
Each variant is classified by predictor-presence into one of 4 cells:
- Both AM and REVEL missing: rare, only 0.35% of variants.
- AM-only: AM present, REVEL missing (3.43%).
- REVEL-only: REVEL present, AM missing (1.39%).
- Both present: standard case (94.82%).
2.3 Per-cell Pathogenicity tabulation
Per cell, count Pathogenic and Benign variants. Compute Pathogenic-fraction with Wilson 95% CI (Brown et al. 2001).
2.4 Per-cell gene composition
For each single-coverage cell (AM-only and REVEL-only), tabulate the top 15 contributing genes to characterize the per-cell selection bias.
3. Results
3.1 The 4-cell predictor-coverage matrix
| Cell | N | % of 268,024 | Pathogenic | Benign | P-fraction | Wilson 95% CI |
|---|---|---|---|---|---|---|
| Both missing | 940 | 0.35% | 257 | 683 | 27.34% | [24.59, 30.28] |
| AM-only | 9,196 | 3.43% | 1,020 | 8,176 | 11.09% | [10.47, 11.75] |
| REVEL-only | 3,737 | 1.39% | 1,809 | 1,928 | 48.41% | [46.81, 50.01] |
| Both present | 254,151 | 94.82% | 73,908 | 180,243 | 29.08% | [28.90, 29.26] |
The 94.82% of variants with both predictors present have a P-fraction (29.08%) close to the global ~28% rate. The 0.35% with both missing also have ~27% P-fraction (close to global, no bias). The two single-coverage cells exhibit strong opposing biases: AM-only at 11.09% (depressed); REVEL-only at 48.41% (elevated).
3.2 The 4.36× asymmetry between single-coverage cells
- REVEL-only / AM-only P-fraction ratio: 48.41% / 11.09% = 4.36×.
- Gap: 48.41 − 11.09 = 37.32 percentage points.
- Wilson 95% CIs are non-overlapping by ~35 pp.
This is the largest per-cell P-fraction asymmetry observed in our analysis of predictor metadata.
3.3 The REVEL-only subset gene composition
Top 15 genes in the REVEL-only-coverage subset:
| Gene | REVEL-only N | Disease association |
|---|---|---|
| NOTCH1 | 628 | CADASIL, T-cell ALL, Adams-Oliver, congenital heart disease |
| NEB | 142 | Nemaline myopathy |
| TTN | 103 | Cardiomyopathy, muscular dystrophy |
| DSPP | 70 | Dentinogenesis imperfecta |
| PC | 62 | Pyruvate carboxylase deficiency |
| BMPR1A | 57 | Juvenile polyposis |
| DST | 57 | Epidermolysis bullosa |
| OBSCN | 52 | Cardiomyopathy |
| CTC1 | 50 | Dyskeratosis congenita |
| WT1 | 43 | Wilms tumor, Frasier, Denys-Drash |
| MPV17 | 29 | Mitochondrial DNA depletion |
| CCDC39 | 28 | Primary ciliary dyskinesia |
| BSND | 25 | Bartter syndrome |
| CPAMD8 | 24 | Anterior segment dysgenesis |
The REVEL-only subset is dominated by major Mendelian disease genes where AM has zero or near-zero coverage. NOTCH1 alone accounts for 628 / 3,737 = 16.8% of the REVEL-only cell. The high P-fraction (48.41%) reflects that disease-gene variants are heavily curated as Pathogenic.
3.4 The AM-only subset gene composition
Top 15 genes in the AM-only-coverage subset:
| Gene | AM-only N | Disease association |
|---|---|---|
| ALMS1 | 460 | Alström syndrome |
| GRIN2B | 361 | Intellectual disability, autism |
| RECQL4 | 165 | Rothmund-Thomson syndrome |
| SGSH | 153 | Sanfilippo syndrome A |
| POLG | 151 | Mitochondrial DNA depletion |
| HNF1B | 120 | MODY5 |
| MAGEL2 | 105 | Schaaf-Yang syndrome |
| MYH3 | 99 | Distal arthrogryposis |
| EPPK1 | 93 | Plakin |
| ITGB3 | 92 | Glanzmann thrombasthenia |
| GPR179 | 89 | Congenital stationary night blindness |
| MSH6 | 82 | Lynch syndrome |
| CYBA | 78 | Chronic granulomatous disease |
| FRAS1 | 77 | Fraser syndrome |
| SZT2 | 75 | Epilepsy |
The AM-only subset is also disease genes but with different gene composition. The 11.09% P-fraction is depressed because these specific genes have many Benign-curated population variants but fewer Pathogenic curations relative to the gene size. ALMS1 (460 variants total in AM-only, of which most are Benign — Alström syndrome is recessive with extensive population variation) contributes 12.3% of the AM-only cell.
3.5 The selection-bias interpretation
The AM-only vs REVEL-only Pathogenic-fraction asymmetry reflects systematic differences in the disease genes covered by the two predictor pipelines:
- REVEL-only genes (NOTCH1, NEB, BMPR1A, WT1, etc.) are classical Mendelian disease genes with extensive Pathogenic variant curation. AM's coverage gap in these genes — for whatever reason (model architecture, training-data composition, dbNSFP delivery filter) — produces a REVEL-only subset that is dominated by disease-confirmed Pathogenic variants.
- AM-only genes (ALMS1, GRIN2B, RECQL4, etc.) are also disease genes, but the AM-only subset is dominated by population-frequency Benign variants in these genes that REVEL did not score for similar coverage-gap reasons.
The two single-coverage cells therefore reflect complementary asymmetries in the AM vs REVEL coverage profiles. Neither predictor's missing-data pattern is missing-at-random with respect to Pathogenicity.
3.6 The implication: missingness is informative
For variant-prioritization pipelines, missingness of one predictor is informative about Pathogenicity:
- Variant with REVEL-only score: prior P-fraction 48.41% (1.66× elevated).
- Variant with AM-only score: prior P-fraction 11.09% (0.38× depressed).
- Variant with both predictors: prior 29.08% (close to global).
- Variant with both missing: prior 27.34% (close to global, but small N).
The missingness pattern itself encodes ~5× variation in the Pathogenic prior. Naive ensemble methods that treat missing predictors as "no information" are missing this signal.
3.7 The pipeline-specificity caveat
The reported coverage gaps and the per-cell Pathogenicity asymmetry are specific to the dbNSFP v4 / MyVariant.info delivery pipeline. Other delivery channels (direct AlphaMissense supplementary downloads from Cheng et al. 2023; UCSC tracks; etc.) may have different coverage gaps. The per-cell P-fractions are pipeline-specific.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The coverage measurement is via dbNSFP / MyVariant.info pipeline
The 4-cell matrix reflects predictor scores delivered via this pipeline. Variants with scores in primary AM / REVEL data sources but not in the dbNSFP delivery are classified as missing in our pipeline-specific analysis.
4.3 The reasons for missingness vary
Predictor missingness may reflect: (a) source-data exclusions (e.g., AM excluded specific protein architectures from training); (b) dbNSFP version-update lag; (c) UniProt isoform mapping issues. We do not adjudicate the causes.
4.4 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported per-cell Pathogenic-fractions reflect curator-assigned data.
4.5 The both-missing cell is small (n = 940)
The 0.35% both-missing cell has wider Wilson 95% CI [24.59, 30.28] but is consistent with the global rate.
4.6 The per-cell gene composition is pipeline-specific
The top 15 gene lists per cell are specific to the dbNSFP / MyVariant.info pipeline's coverage gaps. They may shift with pipeline updates.
4.7 The asymmetry direction is not predictable in advance
Without the empirical analysis, one would not predict a priori which predictor's missing-only subset would be Pathogenic-enriched vs depressed. The asymmetry direction is a per-pipeline characteristic.
5. Implications
- REVEL-only-coverage ClinVar missense variants have a 48.41% Pathogenic-fraction, 4.36× higher than AlphaMissense-only-coverage variants at 11.09%.
- The asymmetry is statistically robust (Wilson 95% CIs non-overlapping by ~35 pp).
- Mechanism is predictor-coverage selection bias: REVEL-only subset is dominated by major Mendelian disease genes (NOTCH1, NEB, BMPR1A, WT1) where AM has no coverage; AM-only subset is dominated by genes with extensive population-Benign variation that REVEL did not score.
- Predictor missingness is informative about Pathogenicity — naive ensemble methods that treat missing predictors as "no information" miss the missingness-pattern signal.
- For variant-prioritization pipelines: the per-cell coverage prior is precomputable from the missingness pattern and should be incorporated as a meta-feature.
6. Limitations
- Stop-gain excluded (§4.1).
- Coverage measurement is pipeline-specific to dbNSFP / MyVariant.info (§4.2).
- Reasons for missingness not adjudicated (§4.3).
- ClinVar labels not gold-standard (§4.4).
- Both-missing cell small (n = 940) (§4.5).
- Per-cell gene composition is pipeline-specific (§4.6).
- Asymmetry direction not predictable a priori (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~40 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith the 4-cell counts, P-fractions, Wilson 95% CIs, and per-cell top-30 gene composition. - Verification mode: 5 machine-checkable assertions: (a) REVEL-only P-fraction > 40%; (b) AM-only P-fraction < 15%; (c) REVEL-only / AM-only ratio > 3.5×; (d) Wilson 95% CIs non-overlapping; (e) total variants > 250,000.
node analyze.js
node analyze.js --verify8. References
- Cheng, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492.
- Ioannidis, N. M., et al. (2016). REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Pejaver, V., et al. (2022). Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations. Am. J. Hum. Genet. 109, 2163–2177.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Adam, M. P., et al. (2022). GeneReviews. University of Washington, Seattle.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.