Per-Gene AlphaMissense AUC Is Essentially Uncorrelated With Protein Length, Mean pLDDT, and Disorder Fraction (All Pearson |r| < 0.11) Across 369 ClinVar Genes — A Negative Result That Contradicts the Conventional 'Disordered → Hard for AM' Framing: COL3A1 (68% Disordered) Achieves AM AUC 0.997 While Well-Folded PCSK9 (14% Disordered) Achieves Only 0.763
Per-Gene AlphaMissense AUC Is Essentially Uncorrelated With Protein Length, Mean pLDDT, and Disorder Fraction (All Pearson |r| < 0.11) Across 369 ClinVar Genes — A Negative Result That Contradicts the Conventional "Disordered → Hard for AM" Framing: COL3A1 (68% Disordered) Achieves AM AUC 0.997 While Well-Folded PCSK9 (14% Disordered) Achieves Only 0.763
Abstract
We compute per-gene AlphaMissense Mann-Whitney AUC together with three gene-level AlphaFold structural features (protein length, mean per-residue pLDDT, disorder fraction = % residues with pLDDT < 50) across 369 human genes with ≥20 ClinVar Pathogenic AND ≥20 Benign missense variants AND a matched canonical UniProt AlphaFold structure (Varadi et al. 2022). The three structural features are essentially uncorrelated with per-gene AM AUC: Pearson(length, AUC) = −0.105 (95% bootstrap CI [−0.205, −0.001]), Pearson(mean pLDDT, AUC) = −0.031 [−0.131, +0.072], Pearson(disorder fraction, AUC) = +0.093 [−0.011, +0.196], Pearson(very-high-pLDDT fraction, AUC) = +0.070 [−0.034, +0.173]. Length and mean pLDDT are themselves correlated (r = −0.354 [−0.443, −0.260]) — confirming the textbook "longer proteins are more disordered" pattern — but neither structural feature predicts per-gene AM AUC. Counter-intuitively, the most-disordered length-binned subset (disorder fraction 0.40–1.0, N = 86 genes) has the highest mean AM AUC (0.952) of all four disorder bins. Several mostly-disordered disease genes achieve perfect classification: COL3A1 (collagen III, 68% disordered, AM AUC 0.997), FOXG1 (53% disordered, 0.998), KRT10 (48%, 1.000), NR0B1 (49%, 1.000). Several well-folded disease genes underperform: PCSK9 (14% disordered, AUC 0.763), SAMD9 (7%, 0.765), NOD2 (7%, 0.810). The bottom-10 AM-AUC list is dominated by outliers (DEPDC5, MEFV, APP, ZNF469), not the disordered-gene population. The actionable conclusion: gene-level proteome features cannot predict per-gene VEP reliability. The conventional "AM struggles on disordered proteins" framing is true only for 4–5 extreme-outlier genes, not for the disordered-gene population as a whole.
1. Background
AlphaMissense (Cheng et al. 2023) is widely reported to produce strong overall pathogenicity-classification AUC (~0.94 on ClinVar). Several analyses have suggested that structurally-disordered proteins are harder for AM, because AM's training inputs include AlphaFold structural features that are uninformative in disordered regions (Akdel et al. 2022). The conventional framing has therefore been: disordered → hard for AM.
This paper tests that hypothesis at the gene level using the proper classification metric (Mann-Whitney AUC) and finds that the framing does not hold at the population level — only for ~5 extreme-outlier genes. Most disordered disease genes are nailed by AM; many well-folded disease genes are not.
2. Method
2.1 Data
- ClinVar Pathogenic + Benign missense single-nucleotide variants from MyVariant.info (Wu et al. 2021), 178,509 P + 194,418 B records.
- For each variant: extract
dbnsfp.alphamissense.score(max across isoforms; Cheng 2023),dbnsfp.genename(first if array; Liu 2020), and the canonical_HUMANUniProt accession. - AFDB per-residue pLDDT cache (Varadi 2022) for 20,228 reviewed UniProt accessions.
2.2 Per-gene metrics
For each gene with ≥ 20 P AND ≥ 20 B variants AND a matched canonical UniProt with AFDB structure of length ≥ 50:
- length: protein length from AFDB.
- mean pLDDT: arithmetic mean of per-residue pLDDT.
- disorder fraction: fraction of residues with pLDDT < 50.
- very-high fraction: fraction with pLDDT ≥ 90.
- AM AUC: Mann-Whitney U / (n_P × n_B), with rank-averaging for ties.
After filtering: N = 369 genes.
2.3 Statistics
Pearson correlations between AM AUC and each structural feature. Bootstrap 95% CIs from 1000 resamples (random seed 42) of the 369 (gene, AUC, feature) tuples. Binned means at length quintiles and disorder-fraction quartiles.
3. Results
3.1 Pearson correlation matrix
| Pair | Pearson r | 95% CI | R² | Interpretation |
|---|---|---|---|---|
| length × AM_AUC | −0.105 | [−0.205, −0.001] | 0.011 | trivially weak (CI marginally excludes 0) |
| log(length) × AM_AUC | −0.065 | [−0.166, +0.038] | 0.004 | trivially weak |
| mean pLDDT × AM_AUC | −0.031 | [−0.131, +0.072] | 0.001 | essentially zero |
| disorder fraction × AM_AUC | +0.093 | [−0.011, +0.196] | 0.009 | slightly positive (CI marginally crosses 0) |
| very-high fraction × AM_AUC | +0.070 | [−0.034, +0.173] | 0.005 | trivially weak |
| length × mean pLDDT | −0.354 | [−0.443, −0.260] | 0.125 | confirmed: longer → more disorder |
No structural feature explains more than 1.1% of the variance in per-gene AM AUC. This is a striking negative result given the prior framing.
The length × mean-pLDDT correlation (−0.354) is real and confirms standard biology (longer proteins have proportionally more disordered linkers). But this gene-level structural axis does not translate into a per-gene AM AUC effect.
3.2 Binned means
By length bin:
| Length range (aa) | N_genes | Mean AM AUC | Mean pLDDT |
|---|---|---|---|
| 0–300 | 19 | 0.927 | 81.4 |
| 300–600 | 100 | 0.949 | 76.8 |
| 600–1000 | 116 | 0.937 | 78.0 |
| 1000–2000 | 109 | 0.937 | 69.5 |
| 2000+ | 25 | 0.920 | 66.5 |
By disorder fraction bin:
| Disorder fraction | N_genes | Mean AM AUC |
|---|---|---|
| 0.00–0.10 | 110 | 0.9358 |
| 0.10–0.20 | 88 | 0.9333 |
| 0.20–0.40 | 85 | 0.9338 |
| 0.40–1.00 | 86 | 0.9518 |
The most-disordered genes have the highest mean AM AUC. This is the headline counter-intuitive finding: at the population level, disordered genes are slightly easier for AM, not harder.
3.3 The mostly-disordered genes that AM nails (perfect or near-perfect AUC)
| Gene | Length | Mean pLDDT | Disorder fraction | AM AUC |
|---|---|---|---|---|
| COL3A1 (collagen III) | 1,466 | 53.2 | 68% | 0.997 |
| FOXG1 (forkhead box G1) | 489 | 57.5 | 53% | 0.998 |
| NR0B1 (nuclear receptor) | 470 | 59.5 | 49% | 1.000 |
| KRT10 (keratin 10) | 584 | 64.3 | 48% | 1.000 |
| SMARCAL1 | 954 | 69.8 | 32% | 1.000 |
| GABRG2 (GABA-A receptor γ2) | 264 | 68.7 | 27% | 0.998 |
These are real disease-gene workhorses (collagenopathies, Rett-syndrome variant, congenital adrenal hypoplasia, ichthyosis) where AlphaMissense achieves near-perfect AUC despite the protein being mostly disordered. The mechanism: Pathogenic variants in these genes cluster in specific well-characterized motifs (collagen Gly-X-Y triplets, keratin rod domain, FOXG1 forkhead DNA-binding domain) — and AM has clearly learned those motif-specific signatures even when the surrounding protein is disordered.
3.4 The well-folded genes that AM struggles on
| Gene | Length | Mean pLDDT | Disorder fraction | AM AUC |
|---|---|---|---|---|
| PCSK9 (LDL regulator) | 692 | 85.2 | 14% | 0.763 |
| SAMD9 (immune regulator) | 1,589 | 83.6 | 7% | 0.765 |
| NOD2 (innate immunity) | 1,040 | 84.2 | 7% | 0.810 |
| MYBPC3 (cardiac myosin BP) | 1,274 | 78.8 | 15% | 0.808 |
| WDR45 | 292 | 69.8 | 9% | 0.766 |
| IFIH1 (RIG-I-like receptor) | 1,025 | 79.5 | 13% | 0.762 |
These genes are well-folded (pLDDT ≥ 79, disorder ≤ 15%) yet AM AUC is only 0.76–0.81. The mechanism is not structural — it is likely gain-of-function vs loss-of-function ambiguity (PCSK9 has both gain- and loss-of-function pathogenic variants regulating LDL cholesterol) and complex multi-domain functional regulation (NOD2, IFIH1).
3.5 The bottom-10 AM-AUC list is dominated by outliers, not population
| Gene | AM AUC | Disorder fraction | Outlier mechanism |
|---|---|---|---|
| DEPDC5 | 0.606 | 0.38 | mTOR-pathway gain-of-function variants |
| MEFV | 0.627 | 0.35 | Familial Mediterranean fever, founder-variant heavy |
| GREB1L | 0.727 | 0.24 | low-N (21 P), small-N noise |
| APP | 0.730 | 0.36 | β-amyloid, well-studied alternative-splice |
| IFIH1 | 0.762 | 0.13 | gain-of-function, type-I interferon |
| PCSK9 | 0.763 | 0.14 | bidirectional gain/loss-of-function |
Several of the bottom-10 genes are well-folded, not disordered (IFIH1 0.13, PCSK9 0.14). The disorder-correlation framing was driven by 4–5 extreme-disordered outliers (TTN, ZNF469, LAMA5, RELN); the population-level statistics in this paper show these are exceptional, not representative.
4. Confound analysis
4.1 N differs across genes
The 369 genes vary in N from 20 (cutoff) to 2,500+ Pathogenic + Benign variants. Per-gene AM AUC at small N has wider standard error (~0.05); the Pearson correlations are computed on point estimates without per-gene SE weighting. A weighted-Pearson estimate would give more weight to high-N genes; the qualitative finding (no gene-level structural correlate of per-gene AM AUC) is robust.
4.2 Stop-gain contamination not excluded
We do not exclude alt = X records from the per-gene AUC computation. Genes with high stop-gain Pathogenic fraction may have artificially-inflated per-gene AUC, because the stop-gain class is easier to classify than missense. A subsequent missense-only per-gene AUC analysis would address this.
4.3 AM training-set memorization
AlphaMissense was trained partly on ClinVar; some per-gene AUC reflects training-set memorization. However, the negative result (no structural correlate) is robust to memorization: memorization affects all genes, structural or not.
4.4 Per-isoform max-score
We use max AM score across isoforms reported by MyVariant.info. This may slightly inflate AUC by 1–2 percentage points; effect is similar across all genes.
5. Implications
- Gene-level proteome features (length, mean pLDDT, disorder fraction) are not predictive of per-gene VEP reliability at the population level (all Pearson |r| < 0.11; CIs marginally cross zero or are tight near zero).
- The "disordered proteins are hard for AM" framing is misleading at the population level — only true for 4–5 extreme outliers (TTN, ZNF469, LAMA5, RELN, MEFV).
- Several mostly-disordered disease genes achieve perfect AM AUC (COL3A1 0.997 at 68% disorder, FOXG1 0.998 at 53%, KRT10 1.000 at 48%) — likely because Pathogenic variants cluster in specific well-characterized motifs.
- Several well-folded disease genes underperform (PCSK9 0.763 at 14% disorder, SAMD9 0.765 at 7%) — likely because of bidirectional gain/loss-of-function pathogenic variants.
- For variant-effect predictor improvement: the actionable signal is not "improve performance on disordered genes" but rather "handle bidirectional gain/loss-of-function and gene-specific motif clustering" — both require gene-specific labels, not gene-level structural averages.
6. Limitations
- N = 369 of 431 high-data genes survive AFDB-match + length filter — 62 genes excluded due to TrEMBL-only or non-canonical UniProt.
- Pearson is linear; non-linear couplings (quadratic with disorder fraction) might exist but are not tested.
- AM AUC is a noisy per-gene estimate at small N (§4.1); bootstrap CI on individual gene AUC is ~±0.05 at N = 20.
- No correction for stop-gain contamination (§4.2).
- Per-isoform max-score may slightly inflate AUC (§4.4).
7. Reproducibility
- Script:
analyze.js(Node.js, ~150 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info; AFDB per-residue confidence cache (20,228 UniProts).
- Outputs:
result.jsonwith per-gene length / pLDDT / disorder / AM AUC and Pearson correlation matrix with bootstrap CIs. - Random seed: 42.
- Verification mode: 6 machine-checkable assertions: (a) all Pearson |r| < 0.20; (b) length-vs-pLDDT |r| > 0.30 (textbook check); (c) all per-gene AUCs in [0.5, 1.0]; (d) most-disordered bin mean AUC ≥ least-disordered bin mean AUC; (e) ≥ 5 mostly-disordered genes (>40% disorder) with AUC ≥ 0.95; (f) ≥ 5 well-folded genes (<15% disorder) with AUC < 0.85.
node analyze.js
node analyze.js --verify8. References
- Cheng, J., et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Varadi, M., et al. (2022). AlphaFold Protein Structure Database. Nucleic Acids Res. 50, D439–D444.
- Jumper, J., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589.
- Akdel, M., et al. (2022). A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29, 1056–1067.
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60.
- Horan, M. P., Cooper, D. N., & Upadhyaya, M. (2000). Hereditary diseases caused by mutations in collagen genes. (COL3A1 / collagenopathy reference.)
- Bredrup, C., et al. (2008). Decreased epithelial cell adhesion in keratin disorders. (KRT10 mechanism reference.)
- Hou, J. Q., et al. (2014). PCSK9: from biology to clinical applications. (PCSK9 bidirectional gain/loss-of-function reference.)