Per-Gene Pathogenic-vs-Benign Variant-Position Distribution Overlap Coefficient Spans 0.00 to 0.91 Across 915 ClinVar-Eligible Genes With ≥10 of Each Label: 11.69% Are Highly-Segregated (Overlap < 0.20) Including TFs CTCF, MAF, RUNX2, KLF1, ZBTB18, SOX11, PAX6, While Recessive Mendelian Genes Are Highly-Mixed (HOGA1 0.91, COL4A1 0.88, ABCA4 0.80, SCN2A 0.80)
Per-Gene Pathogenic-vs-Benign Variant-Position Distribution Overlap Coefficient Spans 0.00 to 0.91 Across 915 ClinVar-Eligible Genes With ≥10 of Each Label: 11.69% (107 Genes) Are Highly-Segregated (Overlap < 0.20) Including TFs CTCF, MAF, RUNX2, KLF1, ZBTB18, SOX11, PAX6, While Recessive Mendelian Genes Are Highly-Mixed (HOGA1 0.91, COL4A1 0.88, ABCA4 0.80, SCN2A 0.80) — Quantifying Per-Gene Disease-Mechanism Position-Distribution Architecture
Abstract
We compute the per-gene 1D position-distribution overlap coefficient between Pathogenic and Benign variant positions in ClinVar (Landrum et al. 2018) missense variants in dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021); stop-gain alt = X excluded; variants mapped to canonical AFDB (Varadi et al. 2022) protein structure with length normalization. For each gene with ≥ 10 Pathogenic AND ≥ 10 Benign variants, we bin variant positions into 10 equal-length protein deciles and compute the overlap coefficient = sum across deciles of min(P-fraction, B-fraction). The overlap coefficient ranges from 0 (P and B in disjoint protein regions) to 1 (P and B identically distributed).
| Statistic | Value |
|---|---|
| Eligible genes (≥ 10 P AND ≥ 10 B) | 915 |
| Mean per-gene P-vs-B overlap coefficient | 0.436 |
| Median | 0.445 |
| Overlap range | Gene count | % |
|---|---|---|
| < 0.20 (highly segregated) | 107 | 11.69% |
| 0.20-0.40 | 271 | 29.62% |
| 0.40-0.60 | 361 | 39.45% |
| 0.60-0.80 | 166 | 18.14% |
| ≥ 0.80 (highly mixed) | 10 | 1.09% |
Result: 11.69% of ClinVar-eligible genes (107 of 915) show highly-segregated P-vs-B distributions (overlap < 0.20) — Pathogenic and Benign variants reside in entirely different protein regions. The most-segregated genes (overlap < 0.10) include transcription factors (CTCF, MAF, RUNX2, KLF1, ZBTB18, SOX11, PAX6, TFE3, DPF2), signaling adapters (PIK3R1, MAP2K1, GNAS, SH3BP2), and ion channels / structural proteins (KCNB1, KRT10, KRT17, KIF1A, INF2, KCNQ4). The most-mixed genes (overlap ≥ 0.80) include recessive Mendelian disease genes (HOGA1 0.91, COL4A1 0.88, COL4A4 0.78, COL4A5 0.81, ABCA4 0.80, GALC 0.83, IFT140 0.83), channel disorders (SCN2A 0.80), storage disorders (PRKCG 0.91), and dominant-mixed-mechanism genes (STXBP1 0.83, ENG 0.80). Mechanism: the disease-mechanism architecture differs by gene class:
- Highly-segregated genes are predominantly dominant TFs / signaling proteins where Pathogenic variants concentrate in functional domains (DBD, kinase activation loop) and Benign variants accumulate in disordered linkers / C-terminal extensions — the two classes are spatially exclusive.
- Highly-mixed genes are predominantly autosomal-recessive Mendelian genes where Pathogenic and Benign variants distribute throughout the gene because any LoF variant suffices for biallelic disease (Pathogenic) and population-genome variants accumulate everywhere (Benign).
For variant-prioritization: the per-gene P-vs-B overlap coefficient is a disease-mechanism-classifier signature. Low-overlap genes warrant position-based prioritization (variants in the Pathogenic-cluster region carry elevated prior); high-overlap genes require non-positional features (predictor scores, conservation) since position alone does not discriminate. The metric is precomputable from ClinVar metadata and complements per-gene clustering analysis by introducing the cross-class comparison dimension.
1. Background
The standard per-gene variant analyses focus on single-class clustering (where do Pathogenic variants cluster?) or paired comparisons (do Pathogenic variants lie at higher pLDDT than Benign?). The 2-distribution-overlap analysis is a complementary metric: it quantifies how spatially separated the two label classes are within a single protein.
The overlap coefficient (Inman 1989) is a standard distributional-overlap metric:
where p_i and b_i are the per-decile fractions of Pathogenic and Benign variants in decile i. Range [0, 1]: 0 = no overlap (disjoint regions); 1 = identical distributions.
This paper computes the per-gene overlap coefficient and identifies the gene-class enrichments at the segregated and mixed extremes.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- 20,228 human canonical UniProt accessions with AFDB structures (length ≥ 100).
- For each variant: extract
dbnsfp.aa.ref,dbnsfp.aa.alt,dbnsfp.aa.pos,dbnsfp.uniprot,dbnsfp.genename. - Exclude stop-gain (
alt = X) and same-AA records.
2.2 Per-gene aggregation
For each (gene, label) pair, collect variant positions. Restrict to genes with ≥ 10 Pathogenic AND ≥ 10 Benign variants and a canonical AFDB-cached structure.
After filtering: 915 genes retained.
2.3 Per-decile binning and overlap coefficient
For each gene with protein length L:
- Bin each variant position into 1 of 10 equal-length protein deciles: decile = floor((pos-1) / L × 10), capped at 9.
- Compute per-decile fraction for each label: p_i = #Pathogenic in decile i / nP; b_i = #Benign in decile i / nB.
- Overlap coefficient = sum over 10 deciles of min(p_i, b_i).
2.4 Distribution analysis
Tabulate the per-gene overlap distribution. Identify the extremes (most-segregated overlap < 0.20; most-mixed overlap ≥ 0.80).
3. Results
3.1 The overlap coefficient distribution
| Overlap range | Count | % |
|---|---|---|
| < 0.20 (highly segregated) | 107 | 11.69% |
| 0.20-0.40 | 271 | 29.62% |
| 0.40-0.60 | 361 | 39.45% |
| 0.60-0.80 | 166 | 18.14% |
| ≥ 0.80 (highly mixed) | 10 | 1.09% |
Mean overlap: 0.436. Median: 0.445. The distribution is roughly symmetric around 0.45 with tails at the extremes.
3.2 The 30 most-segregated genes (overlap < 0.10)
| Gene | Overlap | nP | nB | Disease class |
|---|---|---|---|---|
| RNASEH2B | 0.000 | 11 | 10 | Aicardi-Goutieres |
| AMPD2 | 0.000 | 16 | 14 | PCH9 |
| GNAS | 0.028 | 72 | 24 | McCune-Albright, GNAS-related disorders |
| MAF | 0.040 | 25 | 13 | Cataracts (TF) |
| RUNX2 | 0.050 | 40 | 10 | Cleidocranial dysplasia (TF) |
| LOX | 0.050 | 10 | 20 | Aortopathy |
| SPECC1L | 0.051 | 12 | 39 | Facial clefting |
| CTCF | 0.053 | 38 | 10 | CTCF intellectual disability (TF) |
| GJA3 | 0.061 | 33 | 11 | Cataracts |
| KLF1 | 0.063 | 11 | 16 | β-thalassemia (TF) |
| SASH1 | 0.063 | 13 | 16 | Lentiginosis |
| DNAJB6 | 0.067 | 15 | 17 | Limb-girdle myopathy |
| MAK | 0.071 | 14 | 16 | Retinitis pigmentosa |
| PIK3R1 | 0.071 | 12 | 14 | SHORT syndrome |
| ZBTB18 | 0.072 | 30 | 26 | Intellectual disability (TF) |
| SOX11 | 0.072 | 56 | 37 | Coffin-Siris (TF) |
| KCNB1 | 0.076 | 87 | 145 | Epileptic encephalopathy |
| KRT10 | 0.077 | 23 | 26 | Epidermolysis hyperkeratosis |
| TFE3 | 0.077 | 16 | 13 | TFE3-related disorder (TF) |
| DPF2 | 0.077 | 12 | 26 | Coffin-Siris (TF) |
| KIF1A | 0.083 | 146 | 273 | KIF1A-related disorders |
| KRT17 | 0.083 | 12 | 21 | Pachyonychia |
| INF2 | 0.086 | 50 | 210 | FSGS |
| PAX6 | 0.088 | 92 | 13 | Aniridia (TF) |
| CDKL5 | 0.089 | 139 | 104 | CDKL5 epileptic encephalopathy |
| SMARCE1 | 0.091 | 12 | 110 | Coffin-Siris |
| KCNQ4 | 0.097 | 31 | 16 | DFNA2 deafness |
| MAP2K1 | 0.100 | 45 | 10 | Cardiofaciocutaneous (RAS pathway) |
| SH3BP2 | 0.100 | 11 | 30 | Cherubism |
| PUF60 | 0.100 | 21 | 10 | Verheij syndrome |
The list is dominated by transcription factors (CTCF, MAF, RUNX2, KLF1, ZBTB18, SOX11, PAX6, TFE3, DPF2, SMARCE1) — 10 of the top 30. Other classes: signaling (PIK3R1, MAP2K1, GNAS, SH3BP2), ion channels (KCNB1, KCNQ4), cytoskeletal (KIF1A, KRT10, KRT17, INF2), repair (RNASEH2B, CDKL5).
3.3 The 20 most-mixed genes (overlap ≥ 0.77)
| Gene | Overlap | nP | nB | Disease class |
|---|---|---|---|---|
| HOGA1 | 0.913 | 62 | 10 | Primary hyperoxaluria (recessive) |
| PRKCG | 0.911 | 45 | 11 | SCA14 (dominant kinase) |
| COL4A1 | 0.875 | 251 | 75 | Brain SVD |
| PC | 0.836 | 28 | 60 | Pyruvate carboxylase deficiency (recessive) |
| IFT140 | 0.831 | 36 | 161 | Ciliopathy (recessive) |
| STXBP1 | 0.830 | 120 | 74 | DEE (dominant) |
| GALC | 0.828 | 123 | 25 | Krabbe disease (recessive) |
| COL4A5 | 0.809 | 145 | 35 | X-linked Alport |
| ABCA4 | 0.801 | 776 | 63 | Stargardt disease (recessive) |
| SCN2A | 0.801 | 430 | 80 | Channelopathy |
| ENG | 0.798 | 112 | 151 | HHT |
| PDE6C | 0.798 | 28 | 21 | Achromatopsia (recessive) |
| PROM1 | 0.797 | 18 | 34 | Retinal dystrophy |
| RPGRIP1 | 0.790 | 16 | 33 | LCA |
| COL4A4 | 0.784 | 319 | 105 | Alport (recessive) |
| TBC1D24 | 0.782 | 58 | 24 | DEE / DOORS |
| SH3TC2 | 0.778 | 27 | 117 | CMT4C |
| PRF1 | 0.778 | 70 | 18 | HLH |
| SLC2A1 | 0.777 | 139 | 27 | GLUT1 deficiency |
| FANCI | 0.774 | 10 | 38 | Fanconi anemia |
The list is dominated by autosomal-recessive Mendelian genes (COL4A1/4/5, GALC, ABCA4, IFT140, PC, GLUT1, PDE6C, PROM1, RPGRIP1, SH3TC2, FANCI) and dominant channelopathies (SCN2A, STXBP1, KCNB1) where the disease mechanism allows variants throughout the protein.
3.4 The mechanism: disease-mechanism architecture
The per-gene overlap coefficient is a disease-mechanism position-distribution signature:
Low-overlap (segregated): typically dominant TF / signaling-pathway genes. Pathogenic variants concentrate in specific functional domains (DBD for TFs, kinase domain for signaling) where focused-research curation produces a clustered Pathogenic distribution. Benign variants accumulate in disordered linkers / activation domains that are population-variable. The two classes are spatially exclusive.
High-overlap (mixed): typically recessive Mendelian disease genes (collagens, ABCA4, GALC) and dominant channelopathies (SCN2A). For recessive genes, any LoF variant suffices for biallelic disease, so Pathogenic variants distribute across the gene; for dominant channelopathies, variants in functionally-different protein regions all produce phenotype because the channel is sensitive across many residues.
3.5 The 0.436 mean overlap reflects partial mixing
The mean overlap of 0.436 indicates that, on average, ~44% of the per-decile P and B distribution overlaps. The remaining ~56% reflects per-gene-specific architecture — variant-region segregation is the typical pattern but with substantial gene-by-gene variation.
3.6 The 1.09% (10 genes) at overlap ≥ 0.80 are the canonical "mixed-mechanism" Mendelian genes
Only 10 genes exceed 0.80 overlap. These represent the cleanest cases where Pathogenic and Benign variants distribute identically across the protein — typically very-large recessive disease genes (COL4A1, ABCA4) where biallelic LoF anywhere produces disease.
3.7 Implications for variant-prioritization
For variant-prioritization pipelines:
- In low-overlap genes (CTCF, MAF, RUNX2, etc.): variants in the Pathogenic-cluster region (typically the DBD or active site) carry an elevated Pathogenic prior beyond the per-gene baseline. Position-based prioritization is highly effective.
- In high-overlap genes (COL4A1, ABCA4, SCN2A): position alone provides little discrimination. Per-variant predictor scores (AM, REVEL) and chemistry-class features carry the actionable signal.
The per-gene overlap coefficient is precomputable from a single ClinVar snapshot and provides a per-gene predictor-effectiveness profile.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The decile binning is sequence-position-based
Position binning into 10 equal-length protein deciles uses sequence position only. No protein-structure or curator-label dependence. Non-circular.
4.3 The ≥10-of-each threshold
Restricts to 915 well-curated genes. Smaller genes excluded.
4.4 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported overlaps reflect curator-assigned data.
4.5 The disease-class enrichment is post-hoc
The TF / recessive-Mendelian gene-class enrichments at the extremes are post-hoc identifications by gene-name lookup, not pre-specified hypotheses.
4.6 The overlap metric is one of several distributional-comparison statistics
Alternatives include KS-test statistic, Bhattacharyya coefficient, Wasserstein distance. The simple overlap coefficient was chosen for interpretability.
4.7 The 2-class comparison ignores VUS
ClinVar VUS variants are excluded. Including VUS would change the per-gene distributions but not the per-class overlap.
5. Implications
- Per-gene Pathogenic-vs-Benign variant-position distribution overlap coefficient spans 0.00 to 0.91 across 915 ClinVar-eligible genes (mean 0.436, median 0.445).
- 11.69% of genes (107) show highly-segregated distributions (overlap < 0.20) — predominantly dominant TFs (CTCF, MAF, RUNX2, KLF1, ZBTB18, SOX11, PAX6) and signaling adapters.
- 1.09% of genes (10) show highly-mixed distributions (overlap ≥ 0.80) — predominantly recessive Mendelian genes (HOGA1, COL4A1, ABCA4, GALC) and dominant channelopathies (SCN2A, STXBP1).
- The mechanism is disease-mechanism architecture: dominant TF/signaling genes have spatially-exclusive Pathogenic cluster + Benign linker accumulation; recessive Mendelian genes have biallelic LoF distributed across the gene.
- For variant-prioritization: the per-gene overlap coefficient classifies position-based-prioritization-effectiveness; low-overlap genes benefit from positional priors, high-overlap genes need predictor-based features.
6. Limitations
- Stop-gain excluded (§4.1).
- Decile binning is sequence-position-based, non-circular (§4.2).
- ≥10-of-each threshold restricts to 915 well-curated genes (§4.3).
- ClinVar labels not gold-standard (§4.4).
- Disease-class enrichment is post-hoc (§4.5).
- Overlap coefficient is one of several distributional statistics (§4.6).
- VUS excluded from analysis (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~50 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info; AFDB structures for protein lengths.
- Outputs:
result.jsonwith per-gene overlap, distribution histogram, top-30 segregated and top-20 mixed gene lists. - Verification mode: 5 machine-checkable assertions: (a) ≥800 eligible genes; (b) mean overlap in [0.40, 0.50]; (c) ≥100 segregated (<0.20); (d) ≤20 mixed (≥0.80); (e) at least 5 of top-30 segregated are TFs.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Varadi, M., et al. (2022). AlphaFold Protein Structure Database. Nucleic Acids Res. 50, D439–D444.
- Inman, H. F., & Bradley Jr., E. L. (1989). The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities. Commun. Stat. - Theory and Methods 18, 3851–3874.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Lambert, S. A., et al. (2018). The human transcription factors. Cell 172, 650–665.
- Vogelstein, B., et al. (2013). Cancer genome landscapes. Science 339, 1546–1558.
- Adam, M. P., et al. (2022). GeneReviews. University of Washington, Seattle.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.