X-Chromosome Missense Variants Are 11.1 Percentage-Points More Likely to Be Pathogenic Than Autosomal Missense Variants in ClinVar: 47.84% Pathogenic Fraction (Wilson 95% CI [47.15, 48.54]) Across 19,965 X-Chromosome Records vs 36.72% [36.55, 36.90] Across 288,548 Autosomal Records
X-Chromosome Missense Variants Are 11.1 Percentage-Points More Likely to Be Pathogenic Than Autosomal Missense Variants in ClinVar: 47.84% Pathogenic Fraction (Wilson 95% CI [47.15, 48.54]) Across 19,965 X-Chromosome Records vs 36.72% [36.55, 36.90] Across 288,548 Autosomal Records
Abstract
We compute the per-chromosome-class Pathogenic-fraction comparison of ClinVar missense single-nucleotide variants restricted to autosomes (chromosomes 1–22), the X chromosome, the Y chromosome, and mitochondrial DNA (MT). For each variant: extract chromosomal location from the HGVS-style _id field; restrict to missense (aa.alt ≠ X); annotate via dbNSFP v4 (Liu et al. 2020) / MyVariant.info (Wu et al. 2021). Result: per-chromosome-class Pathogenic fractions:
| Chromosome class | n_Pathogenic | n_Benign | Total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| Autosomes (1–22) | 105,962 | 182,586 | 288,548 | 36.72% | [36.55, 36.90] |
| X chromosome | 9,552 | 10,413 | 19,965 | 47.84% | [47.15, 48.54] |
| Y chromosome | 21 | 9 | 30 | 70.00% | (small N) |
| MtDNA | 11 | 124 | 135 | 8.15% | (small N) |
The X-chromosome Pathogenic fraction (47.84%) exceeds the autosomal Pathogenic fraction (36.72%) by 11.12 percentage points — the Wilson 95% CIs are non-overlapping by ~10 percentage points. The X-chromosome P-fraction is 30% higher in relative terms than the autosomal baseline. The mechanism is X-linked-recessive-disorder ascertainment: X-chromosome variants are extensively curated for X-linked recessive disorders (Duchenne / Becker muscular dystrophy in DMD; hemophilia A in F8 and B in F9; X-linked intellectual disability genes; X-linked retinitis pigmentosa in RPGR / RP2; ornithine transcarbamylase deficiency in OTC; Wiskott-Aldrich syndrome in WAS; Hunter syndrome in IDS; Fabry disease in GLA; many others). X-linked recessive variants in males are hemizygous (single allele expressing the phenotype), making clinical detection sensitive; clinical-research focus on X-linked Mendelian disorders is correspondingly intense. The 47.84% X-Pathogenic fraction is also reflected in the high per-X-gene clinical-curation density: X-linked classical disease genes (DMD, F8, OTC, GLA, IDS) are among the most-densely-curated genes in ClinVar. For variant-prioritization pipelines: a missense variant on the X chromosome carries a higher Pathogenic prior than the same variant on an autosome — the per-chromosome-class prior should be applied as a metadata feature.
1. Background
X-linked Mendelian disorders are clinically distinctive:
- Recessive expression in males: males are hemizygous for X-chromosome genes; X-linked-recessive variants express phenotype with a single mutant allele (e.g., Duchenne MD, hemophilia, color blindness).
- Dominant expression in females: X-linked-dominant variants express in heterozygous females, with milder phenotypes due to X-inactivation mosaicism (e.g., Rett syndrome MECP2; X-linked hypophosphatemia).
- Heterozygous-female-carrier identification: family-history screening identifies carrier females and at-risk male offspring.
Clinical-research focus on X-linked disorders has been intensive since the early 20th century (Bell & Haldane 1937 first quantified X-linkage in hemophilia and color blindness). ClinVar X-chromosome submissions are correspondingly enriched with curated Pathogenic variants from X-linked disease cases.
This paper measures the per-chromosome-class Pathogenic-fraction asymmetry directly with formal Wilson 95% confidence intervals.
2. Method
2.1 Data
- 178,509 Pathogenic + 194,418 Benign ClinVar single-nucleotide variants from MyVariant.info, with dbNSFP v4 annotation.
- For each variant: extract chromosome from the
_idHGVS-style field via regex^chr([0-9XYMT]+). Variants on alt-contigs / decoy / unmapped scaffolds (~0.1% of records) are excluded. - Exclude stop-gain (
alt = X) and same-AA records.
After filtering: 288,548 autosomal + 19,965 X + 30 Y + 135 MT = 308,678 missense variants with parsed chromosome.
2.2 Per-chromosome-class aggregation
Group variants into 4 classes: autosomes (chr1–22), X, Y, MT. Per class compute n_P, n_B, total, P_fraction, Wilson 95% CI.
3. Results
3.1 Per-chromosome-class Pathogenic fraction
| Chromosome class | n_P | n_B | Total | Pathogenic fraction | Wilson 95% CI |
|---|---|---|---|---|---|
| Autosomes (1–22) | 105,962 | 182,586 | 288,548 | 36.72% | [36.55, 36.90] |
| X chromosome | 9,552 | 10,413 | 19,965 | 47.84% | [47.15, 48.54] |
| Y chromosome | 21 | 9 | 30 | 70.0% | (N=30, wide CI) |
| MtDNA | 11 | 124 | 135 | 8.15% | (N=135, wide CI) |
The Wilson 95% CI on autosomes is [36.55%, 36.90%] (very tight at this N). The Wilson 95% CI on X is [47.15%, 48.54%] (tight at this N). The two CIs are non-overlapping by ~10 percentage points — the X-vs-autosomal P-fraction asymmetry is statistically robust.
3.2 The 11.12-percentage-point X-vs-autosomal asymmetry
The X-chromosome P-fraction (47.84%) exceeds the autosomal P-fraction (36.72%) by 11.12 percentage points — a ratio of 1.30× (47.84 / 36.72). In absolute terms, X-chromosome missense variants are 1.30× more likely to be classified Pathogenic than autosomal missense variants in our cache.
The N is large (19,965 X-chromosome variants; 288,548 autosomal). The asymmetry is highly statistically significant (chi-square p-value < 10⁻⁵⁰; not formally computed here but obvious from the non-overlapping Wilson CIs at this N).
3.3 The biological / curatorial mechanism
The X-Pathogenic asymmetry reflects multiple mechanisms:
- X-linked recessive disorders: classical examples DMD (Duchenne / Becker MD), F8 (hemophilia A), F9 (hemophilia B), G6PD (G6PD deficiency), CFTR-X (no, CFTR is autosomal), OTC (urea cycle disorder), GLA (Fabry disease), IDS (Hunter syndrome), WAS (Wiskott-Aldrich), HPRT1 (Lesch-Nyhan), RPGR (X-linked retinitis pigmentosa). Each of these is heavily clinically-curated.
- X-linked dominant disorders: MECP2 (Rett syndrome), PHEX (X-linked hypophosphatemia).
- X-linked intellectual disability: > 100 X-linked ID genes catalogued; a major contribution to ClinVar X-chromosome curation.
- Clinical research focus: X-linked disorders affect males more severely than females and are easier to ascertain through pedigree analysis; clinical research has historically focused intensively on the X chromosome.
The 47.84% X P-fraction reflects this concentrated curation focus.
3.4 The Y chromosome and mtDNA caveats
- Y chromosome (n=30): only 30 missense variants in our cache. The 70% Pathogenic fraction is based on small N and has wide CI. Y-chromosome genes are mostly testis-specific; few are Mendelian-disease-associated (SRY for sex determination; AZF region for spermatogenesis; KDM5D / RPS4Y1 for autoimmune / hematological associations). The high-N Y-chromosome data would require expanded clinical sequencing.
- mtDNA (n=135): 8.15% Pathogenic fraction. Mitochondrial variants are evaluated under different ACMG criteria (heteroplasmy, maternal inheritance) and the per-mtDNA submission pattern differs from nuclear DNA.
Both Y and MT classes are excluded from the autosomes-vs-X comparison because of the small N.
3.5 The autosomal baseline
The autosomal P-fraction of 36.72% is the corpus baseline for missense-Pathogenic-fraction in our cache. The cross-genome corpus average P-fraction (autosomes + X) is approximately (105,962 + 9,552) / (288,548 + 19,965) = 115,514 / 308,513 = 37.4%. The X-vs-autosomal-vs-corpus-baseline relationships are consistent: X is above the corpus baseline; autosomes are at the corpus baseline (autosomes account for >93% of variants).
4. Confound analysis
4.1 Stop-gain explicitly excluded
We filter alt = X. Reported numbers are missense-only.
4.2 X-chromosome-specific clinical-research focus
The X-chromosome P-fraction asymmetry is dominantly a clinical-research-focus signal, not a biological-pathogenicity signal. X-linked disorders have been intensively studied since the early 20th century; the per-chromosome P-fraction reflects this curation focus.
4.3 Per-gene composition differs
The X chromosome has approximately 800 protein-coding genes (chrX). Of these, a large fraction are well-curated X-linked disease genes (~100 X-linked intellectual disability genes alone). The X P-fraction asymmetry is partly driven by the higher fraction of disease-active genes per X-chromosome megabase.
4.4 Hemizygous males vs heterozygous females
X-linked recessive variants in males are hemizygous (1 allele expressing); the clinical sensitivity for detecting such variants is high. This contributes to the X-chromosome over-representation of Pathogenic submissions.
4.5 No correction for variant-type (within-missense)
We do not stratify the missense variants by substitution-class. The X-vs-autosomal P-fraction asymmetry could partially reflect different substitution-class distributions; this is an interesting follow-up but beyond scope.
4.6 Wilson CI assumes binomial sampling
Per-class counts are binomial. Wilson 95% CI is appropriate.
4.7 ClinVar curatorial bias
Pathogenic variants are over-reported in clinically-actionable disease genes; X-linked disease genes are heavily clinically-actionable. The X P-fraction reflects this curatorial bias.
5. Implications
- X-chromosome missense variants are 1.30× more likely to be Pathogenic than autosomal variants in ClinVar (47.84% vs 36.72%; 11.12-percentage-point absolute difference; Wilson 95% CIs non-overlapping by ~10 pp).
- The asymmetry is dominantly a clinical-research-focus signal on X-linked disorders (DMD, F8, OTC, GLA, IDS, WAS, RPGR, MECP2, PHEX, ~100 X-linked ID genes).
- For variant-prioritization pipelines: a missense variant on the X chromosome carries a higher Pathogenic prior than on an autosome; per-chromosome-class prior is a useful metadata feature.
- The Y-chromosome and mtDNA classes have small N and require expanded clinical sequencing for reliable per-class statistics.
- The autosomal P-fraction of 36.72% is the per-chromosome-comparable corpus baseline.
6. Limitations
- Stop-gain excluded (§4.1).
- X-chromosome clinical-research focus drives the asymmetry (§4.2) — not pure biology.
- Per-gene composition differs between X and autosomes (§4.3).
- Hemizygous males vs heterozygous females ascertainment (§4.4).
- No within-missense substitution-class stratification (§4.5).
- ClinVar curatorial bias (§4.7).
7. Reproducibility
- Script:
analyze.js(Node.js, ~30 LOC, zero deps). - Inputs: ClinVar P + B JSON cache from MyVariant.info.
- Outputs:
result.jsonwith per-class counts, Pathogenic fractions, Wilson 95% CIs. - Verification mode: 5 machine-checkable assertions: (a) Σ per-class counts = total; (b) all P-fractions in [0, 1]; (c) Wilson CIs contain the point estimates; (d) X P-fraction > autosomes P-fraction; (e) Wilson CIs of X and autosomes are non-overlapping.
node analyze.js
node analyze.js --verify8. References
- Landrum, M. J., et al. (2018). ClinVar. Nucleic Acids Res. 46, D1062–D1067.
- Liu, X., Li, C., Mou, C., Dong, Y., & Tu, Y. (2020). dbNSFP v4. Genome Med. 12, 103.
- Wu, C., et al. (2021). MyVariant.info. Bioinformatics 37, 4029–4031.
- Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. J. Am. Stat. Assoc. 22, 209–212.
- Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Stat. Sci. 16, 101–133.
- Bell, J., & Haldane, J. B. S. (1937). The linkage between the genes for colour-blindness and haemophilia in man. Proc. R. Soc. Lond. B 123, 119–150.
- Lyon, M. F. (1961). Gene action in the X-chromosome of the mouse. Nature 190, 372–373. (X-inactivation reference.)
- Ropers, H. H., & Hamel, B. C. J. (2005). X-linked mental retardation. Nat. Rev. Genet. 6, 46–57.
- Karczewski, K. J., et al. (2020). gnomAD constraint spectrum. Nature 581, 434–443.
- Richards, S., et al. (2015). ACMG/AMP variant interpretation guidelines. Genet. Med. 17, 405–424.