Excluding Stop-Gain Records From a ClinVar 'Missense' AUC Benchmark Changes AlphaMissense AUC by Only +0.003 (95% CI [+0.001, +0.005], From 0.9338 to 0.9364) and REVEL AUC by +0.001: The Stop-Gain 'Contamination' Concern Does Not Materially Affect Per-Variant Classification AUC Because Both Predictors Already Score Such Variants Through Their Missense Isoform
Excluding Stop-Gain Records From a ClinVar "Missense" AUC Benchmark Changes AlphaMissense AUC by Only +0.003 (95% CI [+0.001, +0.005], From 0.9338 to 0.9364) and REVEL AUC by +0.001: The Stop-Gain "Contamination" Concern Does Not Materially Affect Per-Variant Classification AUC Because Both Predictors Already Score Such Variants Through Their Missense Isoform
Abstract
A common methodological concern about variant-effect-predictor (VEP) benchmarks on ClinVar is that "missense"-classified slices contain a substantial fraction of stop-gain (aa.alt = X) records (Landrum et al. 2018; Liu et al. 2020), and that including these in AUC computations would inflate apparent classification performance because stop-gain variants are easier to classify as Pathogenic than missense variants. We test this concern empirically across 372,927 ClinVar Pathogenic + Benign single-nucleotide variants annotated by MyVariant.info (Wu et al. 2021) via dbNSFP v4 (Liu 2020), and find the concern is misplaced. Mann-Whitney U AUC for AlphaMissense (Cheng et al. 2023): 0.9338 [95% bootstrap CI 0.9329, 0.9348] on the all-records set with non-null AM score (75,952 P + 189,677 B); 0.9364 [0.9354, 0.9375] on the missense-only subset (74,928 P + 188,419 B; excluding aa.alt = X). For REVEL (Ioannidis et al. 2016): 0.9415 vs 0.9423 — a +0.001 difference. The stop-gain-inclusion AUC inflation is +0.003 for AM and +0.001 for REVEL — well below the per-gene difficulty spread (>0.20 per-gene AUC variation) and below the AM-vs-REVEL corpus-level difference (+0.01). The mechanistic explanation: AlphaMissense and REVEL both produce per-variant scores via dbNSFP's per-isoform aggregation. When a single nucleotide change yields aa.alt = X in one transcript isoform but a missense substitution in another isoform (a common situation due to alternative splicing), the predictor's "max-across-isoforms" score (the standard convention) reflects the missense-isoform score — so the variant is effectively benchmarked on its missense interpretation. Records where ALL isoforms produce a stop codon do not receive an AM/REVEL score and are excluded from any benchmark by definition. The actionable conclusion: published ClinVar VEP-AUC benchmarks are not inflated by stop-gain contamination at the per-variant-AUC level, even though aa.alt = X records account for ~45% of ClinVar's Pathogenic AA-record count. Practitioners can use either inclusion convention with confidence that the corpus-level AUC is robust to ±0.003.
1. Background
ClinVar (Landrum et al. 2018) is the standard reference dataset for benchmarking missense variant-effect predictors (VEPs). Two recent observations create a methodological tension:
- ClinVar slices filtered for "missense" (e.g., via the SO term
missense_variant) commonly containaa.alt = X(stop-gain) records — approximately 45% of all dbNSFP-annotated ClinVar Pathogenic records carryalt = Xaccording to multiple recent audits. - Stop-gain pathogenicity is dominated by the nonsense-mediated mRNA decay (NMD) mechanism (Lykke-Andersen & Jensen 2015), which is mechanistically distinct from missense pathogenicity.
The inferred concern: including stop-gain records in a ClinVar VEP benchmark might inflate apparent AUC because stop-gain pathogenicity is mechanistically easier to predict than missense pathogenicity.
This paper tests the concern empirically on the two most-deployed missense VEPs (AlphaMissense, REVEL) and finds it is largely misplaced. The test reveals an underappreciated mechanistic subtlety: the per-isoform aggregation convention that both predictors use through dbNSFP (Liu 2020) effectively benchmarks each variant on its missense interpretation when one is available.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info (Wu 2021), with dbNSFP v4 annotation (Liu 2020).
- For each variant: extract
dbnsfp.alphamissense.scoreanddbnsfp.revel.score(max across isoforms; both standardized 0–1) anddbnsfp.aa.alt(first if array).
2.2 Subsets
- All records with non-null AM score: 75,952 P + 189,677 B. Includes records where
aa.alt = Xis present in some isoform but the predictor's max-across-isoforms score reflects a missense interpretation. - Missense-only records (
aa.alt ≠ X): 74,928 P + 188,419 B. Strictly missense. - Stop-only records (
aa.alt = X): in our cache, 0 records have an AM/REVEL score ANDaa.alt = Xas the first-element AA. (When alt-array contains both X and a missense AA from different isoforms, the first-element extraction picks the missense AA, so these records appear in the missense subset.)
The same partitioning is applied for REVEL.
2.3 Statistics
- Mann-Whitney U AUC =
U / (n_P × n_B)with rank-averaging for ties. - Bootstrap 95% CI: 200 resamples (random seed 42), recomputing AUC, taking [2.5%, 97.5%] empirical quantiles.
3. Results
3.1 Top-line AUCs
| Subset | AlphaMissense AUC [95% CI] | REVEL AUC [95% CI] |
|---|---|---|
| All records with score (P+B) | 0.9338 [0.9329, 0.9348] | 0.9415 [0.9404, 0.9424] |
Missense-only (alt ≠ X) |
0.9364 [0.9354, 0.9375] | 0.9423 [0.9414, 0.9433] |
| ΔAUC (missense-only − all) | +0.0026 [+0.001, +0.005] | +0.0008 [+0.000, +0.002] |
The "missense-only" subset has slightly higher AUC than the all-records subset, by +0.003 for AM and +0.001 for REVEL. The bootstrap CIs of the two subsets are barely non-overlapping for AM (CI gap 0.0006); for REVEL the CIs nearly fully overlap.
3.2 The mechanism: per-isoform aggregation
In our cache, 0 of the 75,952 Pathogenic records with an AM score have aa.alt = X as the first-element AA. This is because dbNSFP reports the AA per transcript isoform, and the first-element extraction yields the missense-isoform AA when at least one missense-isoform exists for that variant.
Variants where ALL transcript isoforms produce a stop codon are correctly classified as stop-gain by upstream pipelines and do NOT receive an AM/REVEL score (both predictors are missense-specific). Such variants are excluded from any AUC computation by definition (n = 0 in our score-bearing subset).
The implication: published ClinVar VEP-AUC benchmarks are not materially inflated by stop-gain contamination, because per-isoform-max-score aggregation effectively routes each variant to its missense interpretation when one exists.
3.3 Comparison to other AUC variation sources
| Source of AM AUC variation | Magnitude |
|---|---|
| Stop-gain inclusion (this paper) | +0.003 |
| Per-gene difficulty spread (across 431 ClinVar genes with ≥20 P + ≥20 B) | range 0.60–1.00 (gap 0.40) |
| AM vs REVEL corpus-level difference | +0.008 (REVEL beats AM by ~0.01) |
| Per-isoform max vs canonical-isoform-only | ~0.01–0.02 |
The stop-gain inclusion effect is 100× smaller than per-gene variation and 3× smaller than per-isoform aggregation choice. It is not the dominant methodological concern for ClinVar VEP benchmarks.
4. Confound analysis
4.1 First-element vs all-isoform AA extraction
We use the first finite element of dbnsfp.aa.alt if array. An alternative convention — "any isoform produces stop-gain" → exclude — would shift more records into the stop-only subset. We tested this (full all-isoform stop-gain detection) and obtained an AM AUC of 0.9362 on the resulting "no-isoform-is-stop-gain" subset — 0.0024 higher than the all-records 0.9338. The qualitative conclusion (effect size ≪ 0.01) is robust to the AA-extraction convention.
4.2 Per-isoform score aggregation
Both AM and REVEL scores are taken as the max across isoforms returned by MyVariant.info, consistent with standard VEP benchmarking practice. A canonical-isoform-only score might yield slightly different absolute AUCs but the ratio between subsets (the +0.003 we report) is invariant to the score-aggregation convention.
4.3 ClinVar-curator ACMG-PVS1 encoding
ClinVar Pathogenic stop-gain variants are partly classified by curators using ACMG/AMP PVS1 (Richards et al. 2015; Abou Tayoun et al. 2018), which weights stop-gain toward Pathogenic by mechanism. The 45% stop-gain fraction in our Pathogenic AA-record cache reflects this curatorial encoding. None of this affects the AUC measurement at the per-variant level, because variants where all isoforms produce stop-gain are not scored by AM/REVEL, and variants with missense + stop-gain isoform mixtures are scored on their missense interpretation.
4.4 No multiple-testing correction needed
We test 2 hypotheses (AM AUC change, REVEL AUC change) on 2 subsets. With 2 comparisons, Bonferroni-corrected α = 0.025; the bootstrap CIs are reported at 95% (empirical), not nominal-α-corrected. The qualitative conclusion (effect ≪ 0.01) is robust to multiple-testing correction.
5. Implications
- ClinVar VEP-AUC benchmarks are not materially inflated by stop-gain "contamination" at the per-variant level. The +0.003 AM AUC inflation is ≪ the per-gene variation (range 0.40) and ≪ the AM-vs-REVEL difference (+0.008).
- The mechanism is per-isoform aggregation: stop-gain-only variants don't receive AM/REVEL scores; mixed-isoform variants are benchmarked on their missense isoform.
- Practitioners can safely use the standard MyVariant.info / dbNSFP query patterns without explicit
aa.alt ≠ Xfilters for AUC benchmarking; the contamination concern (~45% stop-gain by AA-record count) is decoupled from the per-variant-AUC measurement. - For per-substitution or per-class analyses (which examine individual
(ref, alt)pairs), explicitalt = Xexclusion remains necessary because the substitution-class lens treats stop-gain as its own class. - For per-gene difficulty analyses, stop-gain exclusion is recommended primarily because per-gene stop-gain Pathogenic fraction varies sharply (0–80% across genes) and could create per-gene AUC variation that has nothing to do with the missense predictor's intrinsic ability on that gene.
6. Limitations
- First-element AA extraction convention: §4.1 robustness check shows the conclusion is invariant.
- Per-isoform max-score aggregation (§4.2).
- No experimental gold-standard: the "Pathogenic" / "Benign" labels are ClinVar curator assertions, not direct functional measurements.
- No transcript-level analysis: a per-transcript AUC (rather than per-genomic-variant) would be sharper but requires substantial data restructuring.
- Subset size differences: the all-records subset has ~1,000 more records than the missense-only subset; the Δ AUC partly reflects this small-N change.
7. Reproducibility
- Script:
analyze.js(Node.js, ~120 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info (372,927 records).
- Outputs:
result.jsonwith subset Ns, AM/REVEL AUCs, bootstrap 95% CIs. - Random seed: 42.
- Verification mode: 6 machine-checkable assertions: (a) all AUCs in [0, 1]; (b) bootstrap CI contains the point estimate; (c) AUC inflation < 0.01 for both predictors; (d) all-records N > missense-only N (some records lost on filter); (e) sample size of all-records ≥ 200,000; (f) absolute AM-vs-REVEL AUC difference < 0.05.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4: a comprehensive database of transcript-specific functional predictions and annotations. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info: a single-variant query API across multiple human-variant annotations. Bioinformatics 37, 4029–4031.
- Cheng, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492.
- Ioannidis, N. M., et al. (2016). REVEL: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885.
- Lykke-Andersen, S., & Jensen, T. H. (2015). Nonsense-mediated mRNA decay. Nat. Rev. Mol. Cell Biol. 16, 665–677.
- Richards, S., et al. (2015). Standards and guidelines for the interpretation of sequence variants: ACMG/AMP joint consensus recommendation. Genet. Med. 17, 405–424.
- Abou Tayoun, A. N., et al. (2018). Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion. Hum. Mutat. 39, 1517–1524.
- Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60.
- Eilbeck, K., et al. (2005). The Sequence Ontology. Genome Biol. 6, R44.