Per-Gene Spatial Clustering of ClinVar Pathogenic Missense Variants Is 7.82× More Common Than Per-Gene Spatial Clustering of Benign Variants: 6.63% of 709 Pathogenic Genes Have Inter-Quartile-Range / Protein-Length < 0.10 (Highly Clustered "Hotspot" Pattern) Vs Only 0.85% of 1,416 Benign Genes — Mean Per-Gene Pathogenic IQR/L = 0.361 vs Benign 0.455 Across Genes With ≥20 Variants Each
Per-Gene Spatial Clustering of ClinVar Pathogenic Missense Variants Is 7.82× More Common Than Per-Gene Spatial Clustering of Benign Variants: 6.63% of 709 Pathogenic Genes Have Inter-Quartile-Range / Protein-Length < 0.10 (Highly Clustered "Hotspot" Pattern) Vs Only 0.85% of 1,416 Benign Genes — Mean Per-Gene Pathogenic IQR/L = 0.361 vs Benign 0.455 Across Genes With ≥20 Variants Each
Abstract
We measure the per-gene spatial clustering of variant residue positions within each protein for both ClinVar (Landrum et al. 2018) Pathogenic and Benign missense single-nucleotide variants in dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021); stop-gain (alt = X) excluded. For each gene with ≥ 20 variants of a given label and an AlphaFold (Varadi et al. 2022) protein-length annotation ≥ 100 residues, we compute the inter-quartile range (IQR) of variant positions normalized by protein length as the per-gene clustering metric. IQR/L < 0.10 indicates a highly-clustered "hotspot" pattern (the central 50% of variant positions span less than 10% of the protein length); IQR/L → 0.5 indicates uniformly distributed variants across the full protein. Result:
| Statistic | Pathogenic | Benign |
|---|---|---|
| Eligible genes (≥ 20 variants of label) | 709 | 1,416 |
| Mean per-gene IQR/L | 0.361 | 0.455 |
| Median per-gene IQR/L | 0.367 | 0.456 |
| Fraction with IQR/L < 0.05 (extreme cluster) | 1.55% (11) | 0.14% (2) |
| Fraction with IQR/L < 0.10 (highly clustered) | 6.63% (47) | 0.85% (12) |
| Fraction with IQR/L ≥ 0.40 (broad spread) | 42.32% (300) | 68.64% (972) |
Pathogenic genes are 7.82× more likely than Benign genes to show highly-clustered IQR/L < 0.10 patterns (6.63% vs 0.85%; Wilson 95% CIs [5.02, 8.70] vs [0.49, 1.48], non-overlapping). The mean per-gene IQR/L is 21% lower for Pathogenic (0.361) than Benign (0.455) — Pathogenic variants concentrate in narrower protein subregions on average. The most-clustered Pathogenic genes (top 30 by IQR/L) include known mutational-hotspot genes: SETBP1 (IQR/L = 0.004, Schinzel-Giedion syndrome — SKI/SnoN-binding cluster), PPP2R1A (0.008, PP2A scaffold mutations cluster in HEAT repeats 5/7), ZEB2 (0.018, Mowat-Wilson — clusters in Smad-interacting domain), FUS (0.019, ALS — C-terminal NLS cluster), APP (0.030, Alzheimer's — β/γ-secretase cleavage-site cluster), CBL (0.041, Noonan-like — TKB/RING-linker cluster), and many transcription factors with DNA-binding-domain clusters (TFAP2A, YY1, MEF2C, FOXF1, GATA2, CTCF, TCF4, SOX10, ZBTB18). The 7.82× ratio between Pathogenic and Benign clustering is the per-gene-level signature of focal functional importance: genes where variants cluster narrowly tend to do so because the cluster region encodes a functionally critical domain or motif. Benign variants distribute more broadly because population-genome sequencing identifies variants across the whole gene without functional bias.
1. Background
The phenomenon of mutational hotspots in disease genes is well-documented for individual genes: TP53 codons 175/245/248/273 in cancer (Vogelstein et al. 2013); HRAS/KRAS/NRAS codons 12/13/61 in cancer (Prior et al. 2012); BRAF V600 in cancer (Davies et al. 2002); FGFR3 codons in achondroplasia. The hotspot pattern reflects functional concentration: variants at a few specific residue positions disrupt a critical functional element (catalytic site, regulatory motif, conformational switch).
What has been less quantified is the genome-wide rate of per-gene Pathogenic variant clustering vs the analogous rate for Benign variants. If clustering is a generic property of disease-curation, both Pathogenic and Benign should show similar clustering distributions. If clustering is specifically driven by functional concentration, Pathogenic should cluster more than Benign.
This paper measures the per-gene clustering distribution and demonstrates the 7.82× Pathogenic-vs-Benign clustering ratio.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- 20,228 human canonical UniProt accessions with AFDB protein-length annotations (Varadi et al. 2022).
- For each variant: extract
dbnsfp.aa.ref,dbnsfp.aa.alt,dbnsfp.aa.pos,dbnsfp.uniprot,dbnsfp.genename. - Exclude stop-gain (
alt = X) and same-AA records. - Map each variant to the canonical _HUMAN UniProt accession with cached AFDB structure.
- Filter: protein length ≥ 100 AND variant
aa.pos ≤ protein length.
2.2 Per-gene aggregation
For each (gene, label) pair, collect the set of variant residue positions. Restrict to genes with ≥ 20 variants of that label class.
After filtering: 709 Pathogenic genes and 1,416 Benign genes (genes can appear in both classes).
2.3 Clustering metric
For each (gene, label) pair, compute the inter-quartile range (IQR) of variant positions divided by the protein length:
where Q1 = 25th percentile and Q3 = 75th percentile of the variant-position distribution.
Interpretation:
- IQR/L → 0: all variants concentrated in a narrow region (highly clustered "hotspot" pattern).
- IQR/L → 0.5: variants uniformly distributed across the protein.
- IQR/L > 0.5: variants over-distributed at the protein extremes (rare for functional genes).
2.4 Per-class distribution
Compute mean, median, and binned distribution of per-gene IQR/L for Pathogenic and Benign separately.
3. Results
3.1 Per-gene IQR/L distribution
| Metric | Pathogenic | Benign |
|---|---|---|
| n eligible genes | 709 | 1,416 |
| Mean IQR/L | 0.361 | 0.455 |
| Median IQR/L | 0.367 | 0.456 |
Pathogenic genes have ~21% lower mean IQR/L than Benign genes (0.361 vs 0.455). The per-gene clustering is consistently tighter for Pathogenic.
3.2 The fraction of "highly clustered" genes (IQR/L < 0.10)
| Statistic | Pathogenic | Benign |
|---|---|---|
| Genes with IQR/L < 0.05 (extreme cluster) | 11 (1.55%) | 2 (0.14%) |
| Genes with IQR/L < 0.10 (highly clustered) | 47 (6.63%) | 12 (0.85%) |
| Wilson 95% CI for clustered fraction | [5.02, 8.70] | [0.49, 1.48] |
Pathogenic genes are 7.82× more likely than Benign genes to be highly clustered (6.63% / 0.85%). The extreme-cluster ratio is even higher: 11.07× (1.55% / 0.14%). Wilson 95% CIs for the IQR/L < 0.10 fraction are non-overlapping by ~3.5 pp.
3.3 The fraction of "broadly spread" genes (IQR/L ≥ 0.40)
| Statistic | Pathogenic | Benign |
|---|---|---|
| Genes with IQR/L ≥ 0.40 | 300 (42.32%) | 972 (68.64%) |
| Genes with IQR/L ≥ 0.60 | 46 (6.49%) | 179 (12.64%) |
Benign genes are 1.62× more likely than Pathogenic genes to be broadly spread (68.64% / 42.32%). The asymmetry mirrors the clustering finding: Benign variants distribute uniformly, Pathogenic variants concentrate in functional regions.
3.4 The most-clustered Pathogenic genes
Top 30 highly-clustered Pathogenic genes (IQR/L < 0.075):
| Gene | n | Protein length | IQR/L | Known hotspot biology |
|---|---|---|---|---|
| SETBP1 | 25 | 1,596 | 0.004 | Schinzel-Giedion: codons 868-871 (SKI/SnoN-binding) |
| PPP2R1A | 21 | 589 | 0.008 | PP2A scaffold: HEAT repeats 5/7 |
| ZEB2 | 26 | 1,102 | 0.018 | Mowat-Wilson: SBD/SID domain |
| FUS | 25 | 526 | 0.019 | ALS: C-terminal NLS cluster |
| APP | 28 | 770 | 0.030 | Alzheimer's: β/γ-secretase cleavage sites |
| COL10A1 | 26 | 680 | 0.040 | Schmid metaphyseal chondrodysplasia: NC1 domain |
| CBL | 34 | 906 | 0.041 | Noonan-like: TKB/RING linker |
| ZBTB20 | 59 | 741 | 0.045 | Primrose syndrome: zinc-finger cluster |
| KCNQ4 | 31 | 695 | 0.045 | DFNA2 deafness: pore region |
| MYT1L | 29 | 1,146 | 0.046 | MYT1L syndrome: zinc-finger cluster |
| MERTK | 20 | 999 | 0.047 | Retinitis pigmentosa: kinase domain |
| TFAP2A | 28 | 328 | 0.055 | Branchiooculofacial: DBD |
| SETD1B | 20 | 1,966 | 0.056 | DEE: SET domain |
| YY1 | 25 | 414 | 0.060 | Gabriele-de Vries: zinc-finger cluster |
| COL6A1 | 67 | 1,028 | 0.061 | Bethlem/Ullrich: triple-helix cluster |
| MEF2C | 31 | 473 | 0.061 | MEF2C haploinsufficiency: MADS/MEF2 box |
| CSF1R | 44 | 972 | 0.062 | Leukoencephalopathy: kinase domain |
| DEAF1 | 50 | 565 | 0.065 | DEAF1 syndrome: SAND domain |
| FOXF1 | 24 | 379 | 0.069 | ACDMPV: forkhead DBD |
| GATA2 | 56 | 480 | 0.069 | MonoMAC: zinc-finger cluster |
| PNPLA1 | 29 | 533 | 0.069 | Ichthyosis: patatin domain |
| FLT4 | 30 | 1,363 | 0.070 | Lymphedema: kinase domain |
| TCF4 | 51 | 667 | 0.070 | Pitt-Hopkins: bHLH DBD |
| SOX10 | 53 | 466 | 0.071 | Waardenburg: HMG-box DBD |
| POLE | 22 | 2,286 | 0.071 | Cancer: exonuclease domain |
| GARS | 38 | 739 | 0.072 | CMT: catalytic domain |
| FGFR2 | 69 | 821 | 0.072 | Apert: Ig-like / kinase domain |
| PRKCG | 38 | 697 | 0.073 | SCA14: C1B / kinase domain |
| CTCF | 38 | 727 | 0.074 | CTCF syndrome: zinc-finger cluster |
| ZBTB18 | 30 | 522 | 0.075 | Intellectual disability: zinc-finger cluster |
The list is dominated by:
- Transcription factors with DBD clusters: TFAP2A, YY1, MEF2C, FOXF1, GATA2, TCF4, SOX10, CTCF, ZBTB18, ZBTB20, MYT1L, DEAF1.
- Receptor / signaling kinases: MERTK, CSF1R, FLT4, FGFR2, PRKCG.
- Specific functional motifs: SETBP1 (degron), PPP2R1A (HEAT repeats), CBL (TKB/RING), GARS (catalytic domain), POLE (exonuclease).
- Structural protein domains: COL10A1 (NC1), COL6A1 (triple-helix).
These genes have well-documented Pathogenic-variant hotspots in the OMIM and disease-gene literature. The genome-wide IQR/L analysis quantifies the hotspot pattern systematically.
3.5 The Benign comparison: 12 highly-clustered Benign genes
Only 12 of 1,416 Benign genes have IQR/L < 0.10 (0.85%). These rare Benign-clustering cases represent population-frequency-rich regions or specific SNP-genotyping focuses, not functional hotspots.
The 7.82× Pathogenic-vs-Benign clustering ratio reflects the functional-concentration mechanism: Pathogenic variants cluster because critical functional regions concentrate disease-causing positions; Benign variants distribute because population-frequency variants accumulate uniformly across the gene.
3.6 Implications for variant-prioritization
The per-gene clustering metric provides a gene-level prior on variant Pathogenicity that complements per-variant predictors:
- For variants in highly-clustered Pathogenic genes (IQR/L < 0.10): variants near the cluster center carry elevated Pathogenicity prior; variants outside the cluster have lower prior.
- For variants in broadly-spread Pathogenic genes (IQR/L ≥ 0.40): the position-within-protein carries less information about Pathogenicity prior.
The per-gene IQR/L is precomputable once per gene and provides a free meta-feature for variant interpretation.
3.7 The IQR-based metric is robust
The IQR is a robust statistic that does not require positions to be normally distributed. The IQR/L ratio is dimensionless (a fraction in [0, 1]) and comparable across proteins of different lengths. Alternative clustering metrics (variance, Gini coefficient of position-density) would give qualitatively similar rankings.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The protein-length filter restricts to ≥ 100 residues
Proteins shorter than 100 aa are excluded for stable IQR/L computation. Most disease genes are ≥ 100 aa.
4.3 The ≥ 20-variant per-gene threshold
Genes with < 20 variants of a given label are excluded for stable IQR/L computation. The 709 Pathogenic and 1,416 Benign eligible genes represent the well-curated subset.
4.4 Variant-position-exceeds-protein-length cases excluded
We filter variants with aa.pos > protein length (rare cases of dbNSFP-AFDB isoform mismatch).
4.5 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported clustering rates reflect curator-assigned data.
4.6 The IQR/L metric is one of several clustering measures
Alternative metrics (variance / L², autocorrelation, density-mode count) might give different rankings. We use IQR/L for robustness.
4.7 The per-gene IQR/L does not distinguish multi-modal clustering
A gene with two narrow Pathogenic-variant clusters at opposite ends of the protein would have a large IQR/L (broadly spread) despite being multi-modally clustered. Multi-modal cluster detection requires more sophisticated metrics not used here.
5. Implications
- Pathogenic missense variants in ClinVar cluster spatially within protein sequences 7.82× more often than Benign variants do (6.63% of Pathogenic genes vs 0.85% of Benign genes have IQR/L < 0.10).
- Mean per-gene Pathogenic IQR/L (0.361) is 21% lower than Benign (0.455) — Pathogenic variants concentrate in narrower protein subregions.
- The most-clustered Pathogenic genes are dominated by transcription factors with DBD clusters, signaling kinases, and well-known hotspot disease genes (SETBP1, PPP2R1A, ZEB2, FUS, APP, CBL, etc.).
- The mechanism is functional concentration: critical functional regions concentrate disease-causing positions while population-genome sequencing identifies Benign variants uniformly.
- For variant-prioritization: per-gene IQR/L is a precomputable meta-feature that quantifies the functional-concentration profile per gene.
6. Limitations
- Stop-gain excluded (§4.1).
- Protein-length ≥ 100 filter (§4.2).
- ≥ 20-variant per-gene threshold restricts to well-curated genes (§4.3).
- Variant-position-exceeds-length cases excluded (§4.4).
- ClinVar labels not gold-standard (§4.5).
- IQR/L is one of several clustering metrics (§4.6).
- Multi-modal clustering not distinguished (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~70 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info; AFDB per-residue cache for protein lengths.
- Outputs:
result.jsonwith Pathogenic / Benign eligible-gene counts, per-class mean / median IQR/L, clustering-fraction with Wilson 95% CIs, and the top-30 most-clustered Pathogenic genes. - Verification mode: 5 machine-checkable assertions: (a) Pathogenic mean IQR/L < Benign mean; (b) Pathogenic clustered-fraction (< 0.10) > 5%; (c) Benign clustered-fraction < 2%; (d) ratio > 5×; (e) ≥ 30 Pathogenic genes with IQR/L < 0.10.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Varadi, M., et al. (2022). AlphaFold Protein Structure Database. Nucleic Acids Res. 50, D439–D444.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Vogelstein, B., et al. (2013). Cancer genome landscapes. Science 339, 1546–1558.
- Prior, I. A., Lewis, P. D., & Mattos, C. (2012). A comprehensive survey of Ras mutations in cancer. Cancer Res. 72, 2457–2467.
- Davies, H., et al. (2002). Mutations of the BRAF gene in human cancer. Nature 417, 949–954.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.