Variation in coding sequence (CDS) length across prokaryotic genomes is routinely reported in comparative genomics, but it remains unclear how much of this variation reflects genuine biological signals versus systematic measurement artifacts introduced by annotation conventions. We collected 21,259 validated CDS entries from 21 phylogenetically diverse prokaryote species (16 bacteria, 5 archaea) via UniProt, cross-referenced with genomic GC content from NCBI Taxonomy.
Cross-cohort Alzheimer’s disease (AD) blood transcriptomic prediction is sensitive to cohort shift and can be misinterpreted without strict evaluation controls. We present an open reproducible study on GEO cohorts GSE63060 and GSE63061 with three design principles: leakage-safe target holdout evaluation, consistent permutation-null reporting, and explicit biological feature ablations using open AMP-AD Agora nominated targets.
Public RNA-seq reanalysis often fails for a simple reason: the repository record does not contain enough evidence to justify the requested contrast. We present `rna-seq-estimability-certificate`, an executable bioinformatics skill that decides whether a bulk RNA-seq differential-expression question is estimable from the available sample annotations and files.
Public RNA-seq repositories make reanalysis possible at large scale, but many studies fail before modeling because the contrast, replicate structure, and minimum sample metadata are underspecified. We present `rna-seq-reanalysis-triage`, a bioinformatics skill for agent-executable first-pass assessment of public bulk RNA-seq studies.
Epigenetic aging benchmarks typically assess a single chromatin axis and misclassify signatures dominated by nuisance biology. We construct a 208-gene four-pillar benchmark — the Fidelity Atlas — spanning PRC2-linked memory (30 genes), nucleosome turnover (24), nuclear architecture (25), and AP-1 reprogramming (25), with five non-overlapping confounder panels (104 genes).
Gene expression signatures are routinely dismissed as irreproducible when they fail cross-context validation — but how much of that apparent irreproducibility is a measurement artifact? We decompose Cochran's Q into within-program and between-program components across 7 MSigDB Hallmark signatures scored in 30 GEO cohorts (5 biological programs).
Zero-shot missense scoring with protein language models is usually treated as a residue-likelihood problem. SpectralBio tests a simpler complementary hypothesis: mutation-induced changes in the local covariance structure of ESM2 hidden states may carry pathogenicity signal that likelihood-only and eigenvalue-only summaries do not exhaust.
Apply p-curve analysis to 500 meta-analyses from Psychological Bulletin and Psychological Review (2010-2023). Expected distribution under true effects: right-skewed (more small p-values).
Apply 5 TI methods (Monocle3, Slingshot, PAGA, Palantir, scVelo) to 3 gold-standard datasets with known ground truth (synthetic + lineage tracing). Pairwise Kendall τ between pseudotime orderings: mean 0.
Evaluate pose ranking for 285 CASF-2016 complexes using AutoDock Vina rescored with AMBER ff14SB, CHARMM36, and OPLS-AA/M force fields. The top-ranked pose agrees between force fields in only 41% of cases.
Quantify phylogenetic signal (Fritz-Purvis D statistic and Pagel's λ) across evolutionary rate classes in SARS-CoV-2, Influenza A/H3N2, and HIV-1. Signal decays exponentially with substitution rate: λ(r) = exp(-4.
Compare neutral drift model vs frequency-dependent selection for ARG frequency distributions in 3 databases (CARD, ResFinder, AMRFinderPlus) across 2,400 bacterial genomes. Neutral drift (Wright-Fisher with mutation) fits observed frequency spectra with KS p>0.
Compare CLR, ALR, ILR, and raw relative abundance on 4 published microbiome-disease association datasets (IBD, obesity, colorectal cancer, diabetes). The 'winning' method (highest number of significant associations at FDR<0.