Per-Gene AlphaMissense Score-Saturation Rate (Variants With AM ≥ 0.99) Spans 0% to 85.5% Across 1,004 Genes With ≥50 ClinVar Variants: Top-Saturation Genes Are Concentrated in Autosomal-Dominant Developmental-Disorder Transcription Factors and Signaling Molecules (TBL1XR1 85.5%, PAX6 79.3%, PTEN 78.5%, BRAF 69.1%, SOX10 66.7%) — A Per-Gene-Class Predictor-Behavior Asymmetry Spanning 85 Percentage Points
Per-Gene AlphaMissense Score-Saturation Rate (Variants With AM ≥ 0.99) Spans 0% to 85.5% Across 1,004 Genes With ≥50 ClinVar Variants: Top-Saturation Genes Are Concentrated in Autosomal-Dominant Developmental-Disorder Transcription Factors and Signaling Molecules (TBL1XR1 85.5%, PAX6 79.3%, PTEN 78.5%, BRAF 69.1%, SOX10 66.7%) — A Per-Gene-Class Predictor-Behavior Asymmetry Spanning 85 Percentage Points
Abstract
We characterize the per-gene rate at which AlphaMissense (AM; Cheng et al. 2023) assigns the maximum-tier score AM ≥ 0.99 ("saturation tier", well above the 0.564 likely-pathogenic threshold) on ClinVar (Landrum et al. 2018) missense single-nucleotide variants in dbNSFP v4 (Liu et al. 2020) via MyVariant.info (Wu et al. 2021). Stop-gain (alt = X) excluded. Aggregate result: across 263,347 variants with AM scores, 23,966 (9.10%) are at AM ≥ 0.99. Per-gene result: across 1,004 genes with ≥ 50 variants total, the per-gene saturation rate spans an 85-percentage-point range from 0.00% (272 genes; no variants reach AM ≥ 0.99) to 85.53% (TBL1XR1).
| Gene | Total | Saturated (AM ≥ 0.99) | Per-gene saturation rate |
|---|---|---|---|
| TBL1XR1 | 76 | 65 | 85.53% |
| PAX6 | 106 | 84 | 79.25% |
| PTEN | 312 | 245 | 78.53% |
| EBF3 | 50 | 39 | 78.00% |
| PAX3 | 79 | 61 | 77.22% |
| PAX2 | 52 | 40 | 76.92% |
| NR2F1 | 71 | 54 | 76.06% |
| DCX | 100 | 73 | 73.00% |
| TGFBR1 | 88 | 63 | 71.59% |
| LMX1B | 57 | 40 | 70.18% |
| DDX3X | 133 | 92 | 69.17% |
| BRAF | 165 | 114 | 69.09% |
| SOX10 | 72 | 48 | 66.67% |
The pattern: top-saturation genes are predominantly autosomal-dominant developmental-disorder transcription factors and signaling molecules: TBL1XR1, PAX6/2/3, EBF3, NR2F1, DCX, LMX1B, DDX3X, NFIX, SOX10, FOXC1, DNMT3A, EEF1A2, SOX11 are TFs or chromatin-binding proteins; TGFBR1, BRAF, PRKCG, BTK, ATP1A3 are signaling or ion-transport proteins. Of the 13 highest-saturation genes (≥66%), all are autosomal-dominant Mendelian disease genes with high evolutionary conservation across the protein. Conversely, 272 genes (27.1% of eligible genes) have zero variants reaching AM ≥ 0.99: this includes mostly autosomal-recessive Mendelian disease genes (GLB1, MEFV, CYP21A2, ABCC6, MUTYH, COL4A3, PKHD1) and population-frequency-rich genes. For variant-prioritization pipelines: the per-gene AM-saturation rate quantifies AM's "confidence-asymmetry profile" per gene — high-saturation genes have AM concentrated at the score ceiling, suggesting AM has learned to over-call Pathogenicity in these gene families; low-saturation genes have AM never reaching the ceiling, suggesting AM is conservative.
1. Background
AlphaMissense (Cheng et al. 2023) outputs per-variant Pathogenicity scores in [0, 1]. The score distribution has a ceiling at AM = 0.99-1.0 that is reached by ~9% of variants in the global ClinVar P + B subset. The per-gene rate of variants reaching the AM ceiling quantifies how often AM assigns its maximum confidence within a specific gene.
A gene where most variants reach AM ≥ 0.99 indicates that AM is very confident in calling the gene's variants Pathogenic — a "high-confidence-call" gene. A gene where no variants reach AM ≥ 0.99 indicates that AM is more reserved in this gene — a "moderate-call" gene. Both extremes are informative about AM's per-gene behavior.
This paper characterizes the per-gene AM saturation-rate distribution across the full ClinVar P + B missense subset, identifies the gene-classes at each extreme, and notes the implications for variant-prioritization pipelines.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract
dbnsfp.alphamissense.score(max across isoforms) anddbnsfp.genename(first if multi-gene). - Exclude stop-gain (
alt = X) and same-AA records. - Restrict to records with both an AM score and a non-null gene name.
After filtering: 263,347 variants with AM scores across 14,715 genes.
2.2 Saturation classification
A variant is saturated if AM ≥ 0.99 (the "highly likely pathogenic" tier in the upper 1% of the score range).
2.3 Per-gene tabulation
For each gene with ≥ 50 variants total, compute:
tot= total variants with AM scores.sat= variants at AM ≥ 0.99.- Per-gene saturation rate = sat / tot.
After filtering: 1,004 genes with ≥ 50 variants.
3. Results
3.1 Aggregate saturation rate
- 263,347 total variants with AM scores.
- 23,966 (9.10%) saturated at AM ≥ 0.99.
The aggregate saturation rate of 9.10% means that ~1 in 11 ClinVar missense variants gets AM's maximum-confidence score.
3.2 The 1,004-gene per-gene distribution
Per-gene saturation rate distribution across the 1,004 eligible genes:
| Saturation rate range | Gene count | % of eligible genes |
|---|---|---|
| 0% (no AM ≥ 0.99) | 272 | 27.1% |
| 0-10% | 662 (cumulative 934) | 65.9% (cumulative 93.0%) |
| 10-20% | 134 | 13.3% |
| 20-30% | 68 | 6.8% |
| 30-40% | 53 | 5.3% |
| 40-50% | 47 | 4.7% |
| 50-60% | 16 | 1.6% |
| 60-70% | 14 | 1.4% |
| 70-80% | 9 | 0.9% |
| ≥ 80% | 1 | 0.1% |
The distribution is highly skewed. The mode is at 0-10% saturation; only ~10% of genes have saturation rate > 30%; only 0.1% of genes have saturation rate ≥ 80%.
3.3 The top 25 high-saturation genes
| Gene | Total | Saturated | Saturation rate | Disease association |
|---|---|---|---|---|
| TBL1XR1 | 76 | 65 | 85.53% | TBL1XR1-related neurodevelopmental disorder |
| PAX6 | 106 | 84 | 79.25% | Aniridia, eye disease |
| PTEN | 312 | 245 | 78.53% | Cowden syndrome, PTEN hamartoma |
| EBF3 | 50 | 39 | 78.00% | EBF3-related neurodevelopmental disorder |
| PAX3 | 79 | 61 | 77.22% | Waardenburg syndrome |
| PAX2 | 52 | 40 | 76.92% | Renal-coloboma syndrome |
| NR2F1 | 71 | 54 | 76.06% | Bosch-Boonstra-Schaaf optic atrophy |
| DCX | 100 | 73 | 73.00% | X-linked lissencephaly |
| TGFBR1 | 88 | 63 | 71.59% | Loeys-Dietz syndrome |
| LMX1B | 57 | 40 | 70.18% | Nail-patella syndrome |
| DDX3X | 133 | 92 | 69.17% | X-linked intellectual disability |
| BRAF | 165 | 114 | 69.09% | Cardiofaciocutaneous, RASopathy |
| NFIX | 58 | 39 | 67.24% | Sotos syndrome 2, Marshall-Smith |
| KCNA2 | 67 | 45 | 67.16% | Epileptic encephalopathy |
| TUBB2B | 67 | 45 | 67.16% | Cortical dysplasia |
| SOX10 | 72 | 48 | 66.67% | Waardenburg / PCWH |
| PRKCG | 56 | 37 | 66.07% | Spinocerebellar ataxia |
| PURA | 88 | 57 | 64.77% | PURA syndrome |
| ATP1A3 | 162 | 104 | 64.20% | Alternating hemiplegia |
| FOXC1 | 56 | 35 | 62.50% | Axenfeld-Rieger syndrome |
| MEN1 | 226 | 141 | 62.39% | Multiple endocrine neoplasia |
| EEF1A2 | 69 | 43 | 62.32% | EEF1A2 epileptic encephalopathy |
| DNMT3A | 86 | 53 | 61.63% | Tatton-Brown-Rahman, AML |
| BTK | 129 | 79 | 61.24% | X-linked agammaglobulinemia |
| SOX11 | 93 | 54 | 58.06% | Coffin-Siris syndrome |
The top 25 high-saturation genes are dominated by:
- Transcription factors (TBL1XR1, PAX6/2/3, EBF3, NR2F1, LMX1B, NFIX, SOX10/11, FOXC1, DNMT3A) — TFs that bind DNA in highly conserved domains.
- Signaling-pathway molecules (TGFBR1, BRAF, PRKCG) — RAS/MAP kinase and TGF-β pathway components.
- Chromatin-binding / DNA-modifying (DNMT3A, EBF3) — chromatin regulators.
- Ion channels / transporters (KCNA2, ATP1A3) — fundamental cellular functions.
- Cytoskeletal / structural (TUBB2B, DCX) — neurogenesis.
- Cell-cycle / DNA repair (DDX3X, MEN1).
All are autosomal-dominant Mendelian disease genes with high evolutionary conservation across the entire protein.
3.4 The 272 zero-saturation genes
272 of the 1,004 eligible genes (27.1%) have zero variants reaching AM ≥ 0.99. These genes include:
- Autosomal-recessive Mendelian disease genes: GLB1 (β-galactosidase / GM1 gangliosidosis), MEFV (familial Mediterranean fever), CYP21A2 (congenital adrenal hyperplasia).
- Genes with mostly Benign or population-frequency variants: ABCC6, MUTYH, COL4A3, PKHD1, ENG.
- Cardiac-arrhythmia genes: TMEM43.
- Disease-modifier genes: TTR (transthyretin), PRF1.
The pattern: zero-saturation genes are autosomal-recessive or population-frequency-rich genes where AM's training data does not consistently produce maximum-confidence calls.
3.5 The mechanism: AM training-set composition
The per-gene saturation-rate asymmetry likely reflects the composition of AM's training set:
- Top-saturation genes (autosomal-dominant developmental-disorder TFs and signaling molecules) were over-represented in the AM training data with a strong "any variant in this gene is Pathogenic" pattern. AM has learned to assign maximum confidence for any missense in these genes.
- Zero-saturation genes (autosomal-recessive Mendelian, population-frequency-rich) had a more balanced training-data signal where many variants are Benign and few are Pathogenic. AM has learned to assign moderate confidence.
The pattern is not a deficiency of AM but a calibrated reflection of the per-gene Pathogenicity prior in the training data. For variant-prioritization, knowing the per-gene saturation rate informs how to interpret AM scores for that gene.
3.6 Implications for variant-prioritization
For variant-prioritization pipelines using AM:
- Top-saturation genes (rate ≥ 50%): AM essentially predicts "Pathogenic" for the majority of variants. The AM score adds little information beyond gene identity. Other features (REVEL, conservation, family history) carry the actionable variant-level signal.
- Zero-saturation genes: AM's score is informative across the full distribution; even the top-AM variants in these genes have moderate (< 0.99) scores. AM adds substantial variant-level information.
- Intermediate genes: standard AM-score interpretation applies.
The per-gene saturation rate is a precomputable meta-feature that captures AM's per-gene-class confidence asymmetry.
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 The AM ≥ 0.99 saturation threshold is conservative
We use ≥ 0.99 to capture only the maximum-tier scores. Lower thresholds (e.g., ≥ 0.95, ≥ 0.9) would inflate the per-gene saturation count but produce qualitatively similar gene rankings.
4.3 The n ≥ 50 gene-eligibility threshold
Genes with < 50 variants are excluded to ensure per-gene saturation-rate stability. Of the 14,715 total genes, 1,004 satisfy the threshold.
4.4 ClinVar Pathogenic-vs-Benign labels are not gold-standard
Some labels are wrong. The reported per-gene saturation rate is computed across both labels combined; it does not depend on label correctness.
4.5 AM training-set composition is partially proprietary
AM's training set composition is documented in Cheng et al. (2023) but the per-gene weighting is not fully reported. The interpretation of per-gene saturation rate as "training-set-prior reflection" is consistent with but not definitively proven by AM's published architecture.
4.6 Per-isoform max-AM aggregation
We use max-AM across isoforms reported by MyVariant.info per variant. Per-isoform variability is small.
4.7 The autosomal-dominant developmental-disorder pattern is post-hoc
The interpretation of the top-25 list as "autosomal-dominant developmental-disorder genes" is post-hoc by gene-disease lookup. It is consistent with established gene-disease relationships but is not a quantitative classification.
5. Implications
- Per-gene AlphaMissense saturation rate (variants at AM ≥ 0.99) spans 0% to 85.5% across 1,004 genes with ≥ 50 ClinVar variants.
- Top-saturation genes are dominated by autosomal-dominant developmental-disorder transcription factors and signaling molecules (TBL1XR1, PAX6, PTEN, BRAF, SOX10, etc.).
- Zero-saturation genes (272 of 1,004) are dominated by autosomal-recessive Mendelian and population-frequency-rich genes (GLB1, MEFV, CYP21A2, ABCC6, MUTYH).
- The mechanism is per-gene training-data prior reflection in AM — high-saturation genes had strong "any variant Pathogenic" signal in AM's training; zero-saturation genes had balanced signal.
- For variant-prioritization: the per-gene saturation rate is a precomputable meta-feature that informs how to interpret AM scores per gene.
6. Limitations
- Stop-gain excluded (§4.1).
- AM ≥ 0.99 threshold is conservative (§4.2) — robust to alternative thresholds.
- n ≥ 50 gene-eligibility threshold restricts to 1,004 of 14,715 genes (§4.3).
- ClinVar labels not gold-standard (§4.4) — but per-gene saturation rate does not depend on label correctness.
- AM training-set composition partially proprietary (§4.5).
- Per-isoform max-AM aggregation (§4.6).
- Gene-disease classification is post-hoc (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~30 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-gene total / saturated / saturation-rate, top-25 high-saturation genes, count of zero-saturation genes, and the saturation-rate distribution histogram. - Verification mode: 5 machine-checkable assertions: (a) aggregate saturation rate ≈ 9%; (b) top-saturation gene rate > 80%; (c) ≥ 200 genes with zero saturation; (d) per-gene rate range > 80 percentage points; (e) ≥ 1,000 eligible genes.
node analyze.js
node analyze.js --verify8. References
- Cheng, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Lambert, S. A., et al. (2018). The human transcription factors. Cell 172, 650–665.
- Tatton-Brown, K., et al. (2014). Mutations in the DNA methyltransferase gene DNMT3A cause an overgrowth syndrome with intellectual disability. Nat. Genet. 46, 385–388.
- McKusick-Nathans Institute (2024). Online Mendelian Inheritance in Man (OMIM). https://omim.org
- Pejaver, V., et al. (2022). Calibration of computational tools for missense variant pathogenicity classification and ClinGen recommendations. Am. J. Hum. Genet. 109, 2163–2177.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.