Per-Gene Stop-Gain-vs-Missense Pathogenicity Gap Identifies a 793-Gene Loss-of-Function-Dominant Disease Subset (63.80% of 1,243 Eligible Genes): Mean Per-Gene Stop-Gain Pathogenic-Fraction Is 98.67% Vs Missense 38.67% (60-pp Gap), With 249 Genes (20.03%) Showing ≥90 pp Gap Including CHAMP1, AMER1, MYO18B, CEP250, NRXN1, LAMC3, PCLO All at 0% Missense Vs 100% Stop-Gain Pathogenic
Per-Gene Stop-Gain-vs-Missense Pathogenicity Gap Identifies a 793-Gene Loss-of-Function-Dominant Disease Subset (63.80% of 1,243 ClinVar-Eligible Genes With ≥20 Missense AND ≥10 Stop-Gain Variants): Mean Per-Gene Stop-Gain Pathogenic-Fraction Is 98.67% Vs Missense 38.67% — A 60-pp Gap, With 249 Genes (20.03%) Showing ≥90 pp Gap Including CHAMP1, AMER1, MYO18B, CEP250, NRXN1, LAMC3, PCLO All at 0% Missense Vs 100% Stop-Gain Pathogenic
Abstract
We compute the per-gene Pathogenic-fraction gap between stop-gain and missense variants for ClinVar (Landrum et al. 2018) variants in dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021). Stop-gain variants identified as aa.alt = X and aa.ref ≠ X; missense identified as aa.alt ≠ X and aa.ref ≠ X and aa.ref ≠ aa.alt. Restricted to 1,243 genes with ≥ 20 missense AND ≥ 10 stop-gain variants. Result:
- Mean per-gene missense Pathogenic-fraction: 38.67%.
- Mean per-gene stop-gain Pathogenic-fraction: 98.67%.
- Mean per-gene gap (stop-gain − missense): 60.00 percentage points.
- Genes with stop-gain − missense gap ≥ 50 pp (LoF-dominant): 793 (63.80%).
- Genes with stop-gain − missense gap ≥ 90 pp (extreme-LoF-dominant): 249 (20.03%).
The top 20+ extreme-LoF-dominant genes have 0.0% missense Pathogenic AND 100.0% stop-gain Pathogenic: CHAMP1, AMER1, COL9A1, ANO6, WAC, MYO18B, ARCN1, SCAF4, CEP250, CENPF, C6, DSG1, NRXN1, LAMC3, C8A, CEP104, PHF21A, TANC2, ADCY10, PCLO, TECPR2, DNAAF2, CNTN2, KIDINS220, plus PCNT at 0.5% / 100.0% (gap 99.5 pp). Mechanism: these are loss-of-function-dominant Mendelian disease genes where the disease mechanism is gene knockout / haploinsufficiency — truncating variants abolish protein function and cause disease (curator-Pathogenic at near-100%), while typical missense substitutions retain partial function and are tolerated (curator-Benign in our sample). The biological interpretation: for these genes, only loss-of-function variants are pathogenic; partial-function missense variants do not produce phenotype. For variant-prioritization: in the 793 LoF-dominant genes, a missense variant of unknown significance has a strong Benign prior (~5-30%) while a stop-gain variant has a strong Pathogenic prior (~95-100%). The per-gene stop-gain-vs-missense gap is a precomputable LoF-dominance-classifier that distinguishes disease-mechanism types and informs variant-class-specific priors.
1. Background
Mendelian disease genes can be classified by their disease mechanism:
- Loss-of-function (LoF) dominant: gene knockout / haploinsufficiency causes disease. Truncating variants (nonsense, frameshift, splice-disrupting) are highly Pathogenic; partial-function missense variants are typically tolerated. Examples: many tumor suppressor genes, transcription factors with dosage sensitivity, hemizygous-male X-linked genes.
- Missense-dominant: specific gain-of-function or dominant-negative missense substitutions cause disease. Truncating variants may be tolerated (because of haploinsufficiency tolerance). Examples: oncogene activating mutations (RAS family), dominant-negative missense in some receptors.
- Mixed: both classes contribute to disease.
The per-gene stop-gain-vs-missense Pathogenic-fraction gap is a quantitative classifier of LoF-dominance: large positive gaps indicate LoF-dominant; near-zero gaps indicate similar Pathogenicity for both variant classes.
This paper measures the per-gene gap distribution across 1,243 ClinVar-eligible genes and identifies the LoF-dominant subset.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract
dbnsfp.aa.ref,dbnsfp.aa.alt,dbnsfp.genename.
2.2 Variant-class classification
- Stop-gain (nonsense):
aa.alt = XANDaa.ref ≠ X. - Missense:
aa.alt ≠ XANDaa.ref ≠ XANDaa.ref ≠ aa.alt. - Same-AA records excluded.
2.3 Per-gene per-class tabulation
Per gene: count msP, msB (missense Pathogenic, Benign) and sgP, sgB (stop-gain Pathogenic, Benign).
2.4 Per-gene gap
Per-gene gap = (stop-gain Pathogenic-fraction) − (missense Pathogenic-fraction).
2.5 Eligibility threshold
Restrict to genes with ≥ 20 missense AND ≥ 10 stop-gain variants for stable per-gene fraction estimation.
After filtering: 1,243 genes retained.
3. Results
3.1 Per-gene aggregate statistics
| Statistic | Value |
|---|---|
| Eligible genes | 1,243 |
| Mean missense P-fraction | 38.67% |
| Mean stop-gain P-fraction | 98.67% |
| Mean gap (sg − ms) | 60.00 pp |
The ~60-pp average gap reflects the dominant pattern: stop-gain variants are nearly always Pathogenic (~99% on average), while missense variants are Pathogenic at ~39% on average (close to the global ~28% rate).
3.2 The LoF-dominant gene subsets
| Threshold | Genes meeting | % of 1,243 |
|---|---|---|
| Gap ≥ 50 pp | 793 | 63.80% |
| Gap ≥ 70 pp | 519 | 41.75% |
| Gap ≥ 90 pp | 249 | 20.03% |
63.80% of analyzed genes have stop-gain − missense gap ≥ 50 pp — clear LoF-dominance signature. 20.03% have ≥ 90 pp gap — extreme LoF-dominance.
3.3 The extreme-LoF-dominant subset (top 25)
| Gene | ms P-fraction | sg P-fraction | Gap | Disease association |
|---|---|---|---|---|
| CHAMP1 | 0.0% (50) | 100.0% (20) | 100.0 | CHAMP1-related neurodevelopmental disorder |
| AMER1 | 0.0% (118) | 100.0% (14) | 100.0 | Osteopathia striata |
| COL9A1 | 0.0% (26) | 100.0% (25) | 100.0 | Multiple epiphyseal dysplasia |
| ANO6 | 0.0% (21) | 100.0% (14) | 100.0 | Scott syndrome |
| WAC | 0.0% (31) | 100.0% (21) | 100.0 | DESSH syndrome |
| MYO18B | 0.0% (92) | 100.0% (58) | 100.0 | Klippel-Feil syndrome |
| ARCN1 | 0.0% (25) | 100.0% (11) | 100.0 | Short-rib syndrome |
| SCAF4 | 0.0% (23) | 100.0% (10) | 100.0 | SCAF4-related disorder |
| CEP250 | 0.0% (48) | 100.0% (40) | 100.0 | Atypical Usher syndrome |
| CENPF | 0.0% (87) | 100.0% (21) | 100.0 | Stromme syndrome |
| C6 | 0.0% (30) | 100.0% (18) | 100.0 | Complement C6 deficiency |
| DSG1 | 0.0% (41) | 100.0% (17) | 100.0 | Striate palmoplantar keratoderma |
| NRXN1 | 0.0% (49) | 100.0% (21) | 100.0 | Pitt-Hopkins-like syndrome |
| LAMC3 | 0.0% (86) | 100.0% (32) | 100.0 | Cobblestone lissencephaly |
| C8A | 0.0% (21) | 100.0% (16) | 100.0 | Complement C8 deficiency |
| CEP104 | 0.0% (29) | 100.0% (15) | 100.0 | Joubert syndrome |
| PHF21A | 0.0% (36) | 100.0% (11) | 100.0 | Potocki-Shaffer syndrome |
| TANC2 | 0.0% (46) | 100.0% (13) | 100.0 | TANC2-related disorder |
| ADCY10 | 0.0% (40) | 100.0% (13) | 100.0 | Hypercalciuria absorptive 1 |
| PCLO | 0.0% (116) | 100.0% (33) | 100.0 | Pontocerebellar hypoplasia 3 |
| TECPR2 | 0.0% (34) | 100.0% (25) | 100.0 | Spastic paraplegia 49 |
| DNAAF2 | 0.0% (34) | 100.0% (21) | 100.0 | Primary ciliary dyskinesia 10 |
| CNTN2 | 0.0% (32) | 100.0% (14) | 100.0 | Epilepsy myoclonic |
| KIDINS220 | 0.0% (53) | 100.0% (13) | 100.0 | KIDINS220-related disorder |
| PCNT | 0.5% (201) | 100.0% (118) | 99.5 | Microcephalic osteodysplastic primordial dwarfism II |
The top 25 extreme-LoF-dominant genes are dominated by:
- Tumor suppressors (AMER1).
- Centrosome / cilia genes (CEP250, CEP104, CENPF, DNAAF2 — primary ciliopathies).
- Multi-domain scaffold proteins (NRXN1 neurexin, LAMC3 laminin, PCLO piccolo, PCNT pericentrin).
- Ion channels / membrane (ANO6, CNTN2).
- Complement components (C6, C8A — complement deficiencies are recessive LoF).
- Cytoskeletal / structural (COL9A1 collagen, MYO18B myosin, DSG1 desmoglein).
These genes share the haploinsufficiency / homozygous-LoF disease mechanism where partial-function missense variants do not produce phenotype.
3.4 The smallest-gap genes
The smallest-gap genes (gap < 5 pp) include:
- PCSK9: ms 30.7% > sg 5.9% (gap −24.8 pp). PCSK9 is a gain-of-function gene where missense activate the protein causing hypercholesterolemia; truncating variants are LoF and protective for cardiovascular disease (curated as Benign or even protective).
- GABRD: ms 2.5% > sg 0.0%. Receptor with mostly tolerated variants.
- ALPL: ms 95.0% ≈ sg 94.9%. Hypophosphatasia — missense and stop-gain similarly Pathogenic.
- SMN1: ms 100% = sg 100%. Spinal muscular atrophy — both classes Pathogenic.
- PAH, GCK, GBA, RPE65: missense and stop-gain similarly Pathogenic (recessive disorders where any LoF variant suffices).
These genes are non-LoF-dominant in different ways: PCSK9 is gain-of-function-dominant; PAH/GCK/GBA are recessive-LoF-dominant where even missense are LoF.
3.5 The mechanism: haploinsufficiency vs gain-of-function classification
The per-gene gap is a quantitative classifier of disease mechanism:
- Large gap (LoF-dominant haploinsufficiency): gene knockout produces phenotype; partial-function missense are tolerated.
- Near-zero gap (recessive-LoF or constitutive disease): any LoF variant (missense or stop-gain) produces phenotype.
- Negative gap (gain-of-function-dominant): missense activate the protein; truncating variants are protective.
The 793-gene LoF-dominant subset (63.80%) and 249-gene extreme-LoF subset (20.03%) provide concrete classifications.
3.6 Implications for variant-prioritization
For variant-prioritization in LoF-dominant genes:
- A novel stop-gain variant: prior P-fraction ~95-100%. Pathogenic with high confidence.
- A novel missense variant of unknown significance: prior P-fraction ~5-30% (LoF-dominant gene baseline). Benign-leaning, requires manual review.
The per-gene gap is a precomputable disease-mechanism classifier from ClinVar metadata, providing variant-class-specific priors.
4. Confound analysis
4.1 Variant-class identification by aa.alt = X
Stop-gain variants are aa.alt = X; missense are non-X. Standard dbNSFP convention.
4.2 The ≥ 20 missense + ≥ 10 stop-gain threshold
Genes with sparse data are excluded. The 1,243 retained genes represent the well-curated subset.
4.3 ClinVar curator labels are not gold-standard
Some labels are wrong. The reported per-gene rates reflect curator-assigned data.
4.4 The 0% missense Pathogenic in many LoF-dominant genes is a sample-size effect
Some of the 0% missense Pathogenic rates may reflect that no Pathogenic missense have been submitted to ClinVar yet, not that none exist. Future curation may identify some.
4.5 The 100% stop-gain Pathogenic in many genes assumes NMD-triggering
C-terminal NMD-escaping stop-gains may be tolerated in some cases. The per-gene 100% sg-Pathogenic includes any tolerable C-terminal cases as small-N noise.
4.6 The disease-association annotations are post-hoc
Gene-disease links cited in §3.3 are from OMIM / GeneReviews lookup, post-hoc to the analysis.
4.7 The PCSK9 gain-of-function pattern is well-known
The PCSK9 gap inversion is a well-documented case (Cohen et al. 2006). Other gain-of-function genes may have similar patterns.
5. Implications
- Per-gene stop-gain-vs-missense Pathogenicity gap is 60.00 pp on average across 1,243 ClinVar-eligible genes (mean ms 38.67%, mean sg 98.67%).
- 63.80% of genes (793) have gap ≥ 50 pp — clear LoF-dominance signature.
- 20.03% of genes (249) have gap ≥ 90 pp — extreme-LoF-dominance with 0% ms / 100% sg pattern.
- The mechanism is haploinsufficiency / homozygous-LoF: partial-function missense are tolerated; truncating LoF variants cause disease.
- For variant-prioritization in LoF-dominant genes: missense variants carry a low Pathogenic prior (~5-30%); stop-gain variants carry a high prior (~95-100%) — variant-class-specific priors should be applied.
6. Limitations
- Stop-gain identified by aa.alt = X standard convention (§4.1).
- ≥ 20 ms + ≥ 10 sg threshold restricts to 1,243 well-curated genes (§4.2).
- ClinVar labels not gold-standard (§4.3).
- 0% ms Pathogenic may reflect curation gap rather than true tolerance (§4.4).
- NMD-escape C-terminal stop-gains may be tolerated (§4.5).
- Disease annotations are post-hoc (§4.6).
- Gain-of-function genes (PCSK9) are exceptions to the LoF-dominance pattern (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~30 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-gene ms/sg counts, P-fractions, gap, and the top-30 LoF-dominant gene list. - Verification mode: 5 machine-checkable assertions: (a) mean ms < 45%; (b) mean sg > 95%; (c) ≥ 700 genes with gap ≥ 50 pp; (d) ≥ 200 genes with gap ≥ 90 pp; (e) total eligible genes > 1,000.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Cohen, J. C., Boerwinkle, E., Mosley, T. H., & Hobbs, H. H. (2006). Sequence variations in PCSK9, low LDL, and protection against coronary heart disease. N. Engl. J. Med. 354, 1264–1272.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Cassa, C. A., et al. (2017). Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810.
- MacArthur, D. G., et al. (2012). A systematic survey of loss-of-function variants in human protein-coding genes. Science 335, 823–828.
- Wright, C. F., et al. (2019). Paediatric genomics: diagnosing rare disease in children. Nat. Rev. Genet. 19, 253–268.
- Adam, M. P., et al. (2022). GeneReviews. University of Washington, Seattle.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.