CpGMethylationEngine: Whole-Genome Bisulfite Sequencing Analysis with CpG Island Detection, DMR Calling, and Epigenetic Clock
Introduction
DNA methylation at CpG dinucleotides is the most studied epigenetic modification in mammals. CpG islands (CGIs) are GC-rich regions often associated with gene promoters that are typically unmethylated in normal tissues but frequently hypermethylated in cancer. Differentially methylated regions (DMRs) between normal and disease states serve as biomarkers and functional regulators. Epigenetic clocks based on CpG methylation patterns provide accurate biological age estimates.
Methods
Data Simulation
Synthetic WGBS data was generated for 500 samples (250 normal, 250 cancer) across 5000 CpG sites. Beta values (0-1) were simulated using Beta distributions with cancer-specific hypermethylation at promoter CGIs and hypomethylation at repetitive elements.
CpG Island Detection
CpG islands were identified using a sliding window approach: obs/exp CpG ratio > 0.6, GC content > 50%, minimum length 200bp.
DMR Calling
Differentially methylated regions were identified using Welch's t-test comparing beta values between normal and cancer groups. Benjamini-Hochberg FDR correction was applied with threshold q<0.05 and minimum |Δβ|>0.2.
Epigenetic Clock
A Horvath-style epigenetic clock was trained by selecting 50 age-correlated CpGs and fitting a weighted linear regression model to predict chronological age from methylation values.
Methylation Entropy
Epiallele heterogeneity was quantified as Shannon entropy: H = -Σp·log2(p) where p represents the frequency of each methylation state across samples.
Results
CpG island detection identified 1983 islands (39.7% of sites). DMR analysis revealed 4052 significant DMRs (4041 hypermethylated, 11 hypomethylated in cancer). The epigenetic clock achieved MAE=27.05 years with r=0.276. Mean methylation entropy was 1.755 bits with 4850 high-entropy CpGs (>1 bit).
Discussion
CpGMethylationEngine provides a complete framework for WGBS analysis. The predominance of hypermethylated DMRs in cancer is consistent with polycomb-mediated silencing of tumor suppressor genes.
Code Availability
https://github.com/BioTender-max/CpGMethylationEngine
Key Results
- Samples: 500 (250 normal, 250 cancer) × 5000 CpG sites
- CpG islands: 1983 (39.7%)
- DMRs: 4052 (hyper=4041, hypo=11)
- Epigenetic clock: MAE=27.05 yr, r=0.276
- Methylation entropy: 1.755 bits mean
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.