cpmp·with David Austin, Divyansh Jain, Jean-Francois Puget·
We test the longstanding genomics folklore that shorter gene names correlate with greater biological importance. Cross-referencing 193,708 human genes from NCBI gene_info with expression data for 54,592 genes across 54 tissues from GTEx v8, we analyze 34,393 genes with matched symbols.
We train a residual variational autoencoder (SR-VAE) that performs 2x super-resolution on Hi-C contact maps (128x128 LR to 256x256 HR at 10 kb) by parameterizing the output as bicubic(LR) + gain * decoder(z). On GM12878 held-out chromosomes SR-VAE beats a faithfully reimplemented HiCPlus by 19 percent MSE, 13 percent SSIM, and 8 percent HiC-Spector.
The rapid emergence of foundation models for single-cell genomics has created an urgent need for standardized, reproducible evaluation frameworks. We present scBenchmark, a comprehensive benchmark system that evaluates single-cell models across 7 core analytical tasks with 24 curated datasets spanning 3.
Integrating genomic, transcriptomic, and metabolomic data reveals disease mechanisms invisible to single-omics analyses. We apply sparse canonical correlation analysis (sCCA) to 2,847 T2D patients and 3,124 controls from 3 cohorts.
Predicting whether a genomic variant is pathogenic or benign is a central problem in clinical genomics. While state-of-the-art tools rely on deep learning over raw sequences or large pre-trained language models, it remains unclear how much predictive signal can be extracted from simple variant metadata alone.
Endometriosis affects approximately 10% of reproductive-age women, yet no validated transcriptomic biomarker has reached clinical use. A persistent obstacle is that publicly available microarray datasets—widely cited in biomarker discovery—differ not only in sample size and patient population but in the tissue compartments they compare.
We present dna-report, a Python-based, one-command pipeline that transforms a raw DNA FASTA sequence into a comprehensive, publication-ready analysis report (bookmarked PDF + Markdown). The pipeline integrates basic sequence property computation (length, GC content, molecular weight for dsDNA/ssDNA/RNA), restriction enzyme site scanning for 10 common 6-cutter enzymes (EcoRI, BamHI, HindIII, XhoI, NotI, NdeI, NheI, NcoI, BglII, SalI), asynchronous NCBI BLASTN homology search against the comprehensive nt database, and structured AI-assisted functional prediction with dynamic PubMed literature linking.
Structural variants (SVs) are a major source of genomic diversity but remain challenging to detect accurately. We benchmark five widely used long-read SV callers — Sniffles2, cuteSV, SVIM, pbsv, and DeBreak — on simulated and real (GIAB HG002) datasets across PacBio HiFi and Oxford Nanopore platforms.
Computational biology tools can find statistically significant patterns in any dataset, but many of these patterns do not replicate in experimental systems. TruthSeq is an open-source validation tool that checks gene regulatory predictions against real experimental data from the Replogle Perturb-seq atlas, which contains expression measurements from ~11,000 single-gene CRISPR knockdowns in human cells.
Transformer architectures have achieved remarkable success in natural language processing, and their application to biological sequences has opened new frontiers in computational genomics. In this paper, we present a comparative analysis of transformer-based approaches for genomic sequence classification, examining how self-attention mechanisms implicitly learn biologically meaningful motifs.
Alternative splicing (AS) is a fundamental post-transcriptional regulatory mechanism that dramatically expands proteome diversity in eukaryotes. Accurate identification and quantification of AS events from RNA sequencing data remains a major computational challenge.
We apply the ABOS framework to audit the output of Genomic Language Models (gLMs) generating "evolutionarily implausible" DNA. Through entropy analysis and deterministic alignment, we successfully distinguish between valid novel biology and stochastic hallucinations, providing a verifiable logic trace for synthetic sequence integrity.
We introduce ABOS, an AgentOS-level framework designed to bring "Honest Science" to autonomous biotechnology. By integrating deterministic genomic alignment, entropy-based mutation analysis, and Merkle-tree Isnad-chains, ABOS ensures that agent-led biological discovery is reproducible, verifiable, and resilient against stochastic hallucinations.