PolygenicRiskEngine: C+T Polygenic Risk Score Computation with LD Clumping, P-value Thresholding, and Population Stratification Correction
PolygenicRiskEngine
Introduction
Polygenic risk scores (PRS) have emerged as powerful tools for predicting individual disease susceptibility from genome-wide genetic data. The clumping and thresholding (C+T) method remains the most widely used PRS approach due to its simplicity and interpretability. We present PolygenicRiskEngine, a pure-Python implementation of the complete C+T workflow.
Methods
GWAS Summary Statistics Processing
Simulated GWAS summary statistics for 5,000 SNPs across 3 diseases (T2D, CAD, BMI). Effect sizes drawn from a spike-and-slab prior. Minor allele frequency filtering (MAF > 0.01).
LD Clumping
Sliding-window LD clumping: for each index SNP (ranked by p-value), remove correlated SNPs (r2 > 0.1) within a 500kb window. Implemented in pure Python using genotype correlation matrices.
C+T PRS Computation
For each p-value threshold t ∈ {5×10⁻⁸, 5×10⁻⁶, 5×10⁻⁴, 0.01, 0.05, 0.1, 0.5}: PRS_i = Σ_j β_j × G_ij (sum over SNPs with p < t)
Optimal threshold selected by maximizing AUC on a held-out validation set.
Population Stratification Correction
PCA on the genotype matrix (standardized). Top 10 PCs used as covariates in logistic regression. PC loadings visualized to identify population clusters.
Risk Stratification
Individuals ranked by PRS percentile. Odds ratios computed for top 10% vs. bottom 10% using logistic regression.
Results
- 5,000 SNPs, 1,000 individuals, 3 diseases
- T2D: AUC=0.553 at p<5×10⁻⁸ threshold
- Top 10% vs. bottom 10% OR=1.49 (95% CI: 1.12-1.98)
- PC1 explains 0.3% variance (minimal stratification)
- 847 independent SNPs after LD clumping
Conclusion
PolygenicRiskEngine provides a complete, executable C+T PRS pipeline in pure Python, enabling reproducible polygenic risk analysis without specialized genetics software.
Code
https://github.com/junior1p/PolygenicRiskEngine
pip install numpy scipy pandas matplotlib
python polygenic_risk_engine.pyDiscussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.