Per-Variant Distinct UniProt-Accession Count Predicts ClinVar Pathogenicity Non-Monotonically: Variants Mapping to ≥11 Distinct UniProt Accessions Have a 45.66% Pathogenic-Fraction (Wilson 95% CI [43.62, 47.71]) — 1.74× the Single-Accession Baseline of 26.32% — Documenting an Isoform-Rich-Gene Pathogenicity-Enrichment Pattern Across 268,024 Missense Variants
Per-Variant Distinct UniProt-Accession Count Predicts ClinVar Pathogenicity Non-Monotonically: Variants Mapping to ≥11 Distinct UniProt Accessions Have a 45.66% Pathogenic-Fraction (Wilson 95% CI [43.62, 47.71]) — 1.74× the Single-Accession Baseline of 26.32% — Documenting an Isoform-Rich-Gene Pathogenicity-Enrichment Pattern Across 268,024 Missense Variants
Abstract
We compute the per-variant Pathogenic-fraction stratified by the distinct UniProt-accession count (number of unique base UniProt accessions a variant maps to in the dbNSFP v4 (Liu et al. 2020) dbnsfp.uniprot field via MyVariant.info (Wu et al. 2021)). Each variant maps to ≥1 UniProt accession; high counts indicate isoform-rich genes / genes with many alternative protein products (paralog ambiguity excluded since we count base accessions like P12345 not the suffixed P12345-2). Stop-gain alt = X excluded. Across 268,024 ClinVar missense single-nucleotide variants:
| Distinct accessions | Pathogenic | Benign | N | P-fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| 1 (single canonical) | 36,362 | 101,808 | 138,170 | 26.32% | [26.09, 26.55] |
| 2 | 21,049 | 47,630 | 68,679 | 30.65% | [30.30, 30.99] |
| 3 | 7,439 | 22,087 | 29,526 | 25.19% | [24.70, 25.69] |
| 4-5 | 6,950 | 11,764 | 18,714 | 37.14% | [36.45, 37.83] |
| 6-10 | 4,158 | 6,508 | 10,666 | 38.98% | [38.06, 39.91] |
| ≥ 11 | 1,036 | 1,233 | 2,269 | 45.66% | [43.62, 47.71] |
Result: variants in genes with many distinct UniProt accessions (≥4) show substantially elevated Pathogenic-fraction (37.14-45.66%) compared to single-accession variants (26.32%). The ≥11-accession bin reaches 45.66% Pathogenic — 1.74× the single-accession baseline. The 1-3 accession bins are non-monotonic (26.32% → 30.65% → 25.19%) likely reflecting heterogeneous mixing of genes; from 4-accession onward the gradient is monotonic. The 2,269 variants with ≥11 accessions are concentrated in major Mendelian disease genes including CACNA1A (491 variants; familial hemiplegic migraine, episodic ataxia, SCA6), DDX3X (128; X-linked intellectual disability), TSC1 (118; tuberous sclerosis), PAX6 (101; aniridia), CHD4 (74; Sifrim-Hitz-Weiss syndrome), FOXP1 (69), MYT1L (56), TCF4 (43; Pitt-Hopkins), MEF2C (41), GABRG2 (46), KCNMA1 (39), CSNK2A1 (16; Okur-Chung syndrome), KIF1A (15). Mechanism: variants mapping to many UniProt accessions are typically in isoform-rich, alternatively-spliced genes that are dosage-sensitive in clinical phenotype. The high-isoform-count gene class is enriched for: (a) developmental / neurological disease genes (PAX6, TCF4, MEF2C, MYT1L, CHD4, FOXP1, DDX3X, KIF1A) where multiple isoforms regulate cell-type-specific expression and disease occurs via dosage-disruption; (b) channels with extensive splicing (CACNA1A, GABRG2, KCNMA1) where splice-variant-specific functions are critical; (c) tumor suppressors / chromatin regulators (TSC1, BRCA1, KAT6B, CHD4, SMARCE1) with multi-isoform regulation. For variant-prioritization: the per-variant UniProt-accession-count is a precomputable metadata feature directly available from the dbnsfp.uniprot field, with a 1.74× P-fraction range. Variants in ≥11-accession multi-isoform-rich genes carry an elevated Pathogenicity prior reflecting their over-representation in dosage-sensitive Mendelian disease genes.
1. Background
The dbNSFP dbnsfp.uniprot field reports all UniProt accessions a variant maps to. Distinct base accessions (e.g., P12345 vs Q67890) reflect variants in proteins with multiple UniProt entries — typically alternatively-spliced isoforms (each isoform may have its own UniProt accession in the canonical isoform-set), paralog ambiguity (variants in genes with closely-related paralogs may map to multiple accessions), or multi-protein-coding loci.
The per-variant distinct accession count is a gene-architecture metadata feature: high counts indicate isoform-rich genes; low counts indicate single-isoform-dominant genes.
This paper measures the per-accession-count Pathogenic-fraction and identifies the gene-class enrichment at high counts.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract
dbnsfp.aa.ref,dbnsfp.aa.alt, anddbnsfp.uniprot(a list of UniProt accession entries). - Exclude stop-gain (
alt = X) and same-AA records.
After filtering: 268,024 missense SNVs.
2.2 Distinct accession count
For each variant: extract all UniProt accession entries from dbnsfp.uniprot. For each entry, take the base accession (drop the isoform suffix; e.g., P12345-2 → P12345). Count the distinct base accessions. Variants with 0 accessions are excluded.
2.3 Bin classification
6 bins: 1, 2, 3, 4-5, 6-10, ≥11 distinct accessions.
2.4 Per-bin tabulation
For each bin, count Pathogenic and Benign. Compute Pathogenic-fraction with Wilson 95% CI (Brown et al. 2001).
3. Results
3.1 The 6-bin gradient
(Full table in the Abstract.)
The per-bin P-fraction distribution:
- 1 accession (138,170 variants): 26.32%.
- 2 accessions (68,679): 30.65%.
- 3 accessions (29,526): 25.19% (slight dip).
- 4-5 accessions (18,714): 37.14%.
- 6-10 accessions (10,666): 38.98%.
- ≥ 11 accessions (2,269): 45.66%.
The gradient is monotonic from 4 onward with a dip at 3 (heterogeneous mixing). The 4-vs-1 ratio is 1.41×; the ≥11-vs-1 ratio is 1.74×.
3.2 The ≥11-accession high-isoform-rich subset
The 2,269 variants with ≥11 distinct accessions are concentrated in 100+ genes, with the top contributors being major Mendelian disease genes:
Channels (extensive splicing):
- CACNA1A (491): episodic ataxia, familial hemiplegic migraine, SCA6.
- GABRG2 (46): epilepsy.
- KCNMA1 (39): epilepsy, paroxysmal dyskinesia.
Transcription factors / chromatin regulators (multi-isoform regulation):
- DDX3X (128): X-linked intellectual disability.
- PAX6 (101): aniridia.
- CHD4 (74): Sifrim-Hitz-Weiss syndrome.
- TCF4 (43): Pitt-Hopkins syndrome.
- MEF2C (41): MEF2C haploinsufficiency.
- MYT1L (56): MYT1L syndrome.
- FOXP1 (69): intellectual disability.
- KAT6B (75): Genitopatellar / SBBYSS.
- ZEB2 (30): Mowat-Wilson syndrome.
- SMARCE1 (25): Coffin-Siris.
- CASK (37): MICPCH.
- WDR45 (49): BPAN (β-propeller protein-associated neurodegeneration).
- ALDH7A1 (25): pyridoxine-dependent epilepsy.
- KIF1A (15): KIF1A-related disorders.
- CSNK2A1 (16): Okur-Chung syndrome.
Tumor suppressors / large multi-domain proteins:
- TSC1 (118): tuberous sclerosis.
- BRCA1 (60).
- DMD (62).
- PCDH15 (60): Usher syndrome.
These genes have alternative splicing producing multiple isoforms that each get separate UniProt accessions. Variants mapping to ≥11 accessions mean the variant overlaps with 11+ distinct isoform-product positions.
3.3 The mechanism: dosage-sensitive isoform-rich disease genes
The high-isoform-count gene class is enriched for dosage-sensitive Mendelian disease genes where multiple isoforms regulate cell-type-specific or developmental-stage-specific expression. Disruption of one isoform may compromise specific cellular contexts without abolishing overall protein function — leading to phenotypes that depend on isoform-balance.
The 45.66% Pathogenic-fraction in the ≥11-accession bin reflects this enrichment: the gene class includes mostly clinically-actionable disease genes with extensive ClinVar curation.
3.4 The 3-accession dip
The 3-accession bin has slightly lower P-fraction (25.19%) than 1 or 2-accession bins (26.32%, 30.65%). This non-monotonicity likely reflects:
- The 3-accession set may include genes with 2 paralogs + 1 alternative isoform mix (heterogeneous gene class).
- Some 3-accession genes are large but with simple splicing (lots of population-frequency variants).
The dip is small (~5 pp) and does not invalidate the 4+ accession monotonic gradient.
3.5 Implications for variant-prioritization
The per-variant distinct UniProt accession count is a precomputable feature directly from the dbnsfp.uniprot field. The per-bin P-fraction provides a gene-architecture-encoded prior:
- Variants in ≥11-accession isoform-rich genes: prior ~46% — strongly Pathogenic-leaning.
- Variants in single-canonical-accession genes: prior ~26% — close to global baseline.
The 1.74× range provides actionable per-variant prior signal complementary to per-variant predictor scores.
3.6 The accession count is not equivalent to alternative splicing count
The accession count reflects how many UniProt entries the variant overlaps with, not the gene's actual isoform count. Some genes with extensive alternative splicing may have a single UniProt accession (canonical) with isoform suffixes (-2, -3, ...); these would count as 1 base accession in our metric. Other genes may have multiple separate UniProt entries (e.g., for paralogs).
The 4+ accession threshold likely captures genes where: (a) the gene has multiple distinct UniProt entries for its isoforms (rare), OR (b) the variant lies in a region overlapping with multiple paralog-or-related genes' UniProt entries.
3.7 The non-circular feature
The distinct-accession count is derived from the dbNSFP / UniProt mapping. It is independent of ClinVar curator labels and predictor training. Non-circular.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The accession count reflects UniProt mapping
The count depends on dbNSFP's UniProt-mapping pipeline. Pipeline updates may shift counts.
4.3 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported Wilson 95% CIs reflect sampling variability.
4.4 The gene-class enrichment is post-hoc
The disease-gene-class identification in §3.2 is post-hoc by gene-name lookup. The 100+ genes with ≥11-accession variants are heterogeneous.
4.5 The 3-accession dip is unexplained
The non-monotonicity at 3 accessions is observed but not mechanistically explained.
4.6 Different from per-isoform AlphaMissense range
The accession count is distinct from per-variant AM-score-range-across-isoforms (which measures per-variant predictor variability). The two features capture different aspects.
4.7 The accession-count feature is gene-architecture-derived
The feature reflects the gene's isoform/paralog architecture, not the per-variant biology. The Pathogenicity correlation derives from disease-gene-architecture enrichment in the high-count bins.
5. Implications
- Per-variant distinct UniProt accession count predicts ClinVar Pathogenicity: 26.32% (1 accession) → 45.66% (≥11 accessions), 1.74× ratio.
- The ≥11-accession bin is concentrated in major Mendelian disease genes (CACNA1A, DDX3X, PAX6, TSC1, CHD4, TCF4, MEF2C, MYT1L, etc.) — typically dosage-sensitive isoform-rich genes.
- The 1-3 accession bins are non-monotonic (26-31% range); the 4+ bin gradient is monotonic.
- The mechanism is gene-architecture-encoded: high-isoform-count genes are over-represented among dosage-sensitive Mendelian disease genes.
- For variant-prioritization: per-variant accession count is precomputable from
dbnsfp.uniprotand provides a 1.74× per-bin range non-circular prior.
6. Limitations
- Stop-gain excluded (§4.1).
- Accession count reflects dbNSFP/UniProt mapping pipeline (§4.2).
- ClinVar labels not gold-standard (§4.3).
- Gene-class enrichment is post-hoc (§4.4).
- 3-accession dip is unexplained (§4.5).
- Different from per-isoform AM range (§4.6).
- Feature is gene-architecture-derived, not per-variant biology (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~30 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-bin counts, P-fractions, Wilson 95% CIs, top-30 contributing genes for ≥11-accession bin. - Verification mode: 5 machine-checkable assertions: (a) ≥11 bin P-fraction > 43%; (b) 1 bin P-fraction in [25, 28]; (c) gradient ratio (≥11 / 1) > 1.6×; (d) total variants > 250,000; (e) ≥11 bin has > 1,000 variants.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- UniProt Consortium (2023). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 51, D523–D531.
- Pan, Q., Shai, O., Lee, L. J., Frey, B. J., & Blencowe, B. J. (2008). Deep surveying of alternative splicing complexity. Nat. Genet. 40, 1413–1415.
- Wang, E. T., et al. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Adam, M. P., et al. (2022). GeneReviews. University of Washington, Seattle.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.