Computational prediction of protein stability changes upon mutation (ΔΔG) underpins rational protein engineering, yet the accuracy of these predictions has not been evaluated for systematic directional bias. We benchmarked six widely used ΔΔG predictors—FoldX, Rosetta ddg_monomer, DynaMut2, MAESTRO, PoPMuSiC, and ThermoNet—on a curated ProTherm-derived test set of 2,648 single-point mutations with experimentally measured stability changes.
Single-cell RNA sequencing has become the dominant technology for characterizing cellular heterogeneity, yet the stability of computational cell-type assignments remains poorly quantified. We systematically evaluated clustering reproducibility by running the standard Seurat pipeline (PCA dimensionality reduction, UMAP embedding, Louvain community detection) across 100 random seeds on each of 10 published scRNA-seq datasets spanning 847,000 cells total.
Mutation rates are typically reported as genome-wide averages, yet individual genes within a single bacterium experience vastly different mutational pressures. We analyzed mutation accumulation experiment data spanning five bacterial species—Escherichia coli, Staphylococcus aureus, Mycobacterium tuberculosis, Pseudomonas aeruginosa, and Bacillus subtilis—encompassing 14,287 protein-coding genes and 38,412 observed de novo mutations.
Epigenetic clocks have become the dominant molecular estimators of biological age, yet systematic comparisons across clocks and tissues within the same individuals remain sparse. We applied four established epigenetic age predictors—Horvath's multi-tissue clock, Hannum's blood-based clock, PhenoAge, and GrimAge—to 500 samples spanning blood, liver, lung, and brain tissue from the Genotype-Tissue Expression (GTEx) project, where multiple tissues were available per donor.
Whole-brain multivariate pattern analysis is widely assumed to outperform region-of-interest approaches by leveraging distributed neural representations. We tested this assumption by training linear support vector machine decoders on six fMRI task datasets—including the Human Connectome Project working memory and motor tasks, the Haxby face/object paradigm, and three additional cognitive paradigms—systematically varying the number of ANOVA-selected voxels from 10 to 5,000.
Molecular docking scoring functions remain central to computational drug discovery pipelines, yet their quantitative accuracy against experimental binding affinities is rarely audited at scale. We benchmarked four widely deployed scoring functions—AutoDock Vina, Glide SP, GOLD ChemScore, and RF-Score—against 5,316 protein-ligand complexes from the PDBbind v2020 refined set, computing Pearson correlations between predicted scores and experimental -log(Ki/Kd) values.
Gene trees frequently conflict with species trees, but the magnitude, predictors, and functional distribution of this disagreement remain poorly quantified for most clades. We reconstructed a species tree from 150 fungal genomes using ASTRAL-III and compared it against individual maximum-likelihood gene trees for 2,000 single-copy orthologs identified via OrthoFinder.
Normalization is a prerequisite for meaningful differential expression analysis of RNA-seq data, yet the choice among competing methods is typically made without quantifying its downstream impact on biological conclusions. We applied five normalization approaches—TMM, DESeq2 median-of-ratios, upper quartile, FPKM, and TPM—to 20 published RNA-seq datasets spanning cancer (n=10) and immunology (n=10) studies, then ran identical DESeq2 differential expression pipelines on each normalized dataset.
The Codon Adaptation Index (CAI) remains the dominant metric for predicting gene expression from sequence data in bacterial genomics, yet its dependence on an externally supplied reference set of highly expressed genes introduces an underappreciated source of variability. We computed CAI for all protein-coding genes across 500 complete bacterial genomes using four distinct reference sets: ribosomal protein genes, RNA-seq-validated highly expressed genes, the top 5% of genes ranked by codon usage frequency, and the original Sharp and Li reference set.
Replication studies in psychology consistently find smaller effect sizes than the originals, a pattern attributed primarily to publication bias and questionable research practices. We investigated whether the time gap between original and replication studies independently predicts effect size shrinkage, after controlling for publication bias indicators and methodological characteristics.
Stan's Hamiltonian Monte Carlo sampler relies on automatic differentiation (AD) to compute gradients of the log-posterior density. These gradients are assumed to be exact, but numerical issues in user-written models can cause the AD gradient to diverge from the true mathematical gradient.
Propensity score subclassification partitions units into strata based on estimated propensity scores, then estimates treatment effects within each stratum. The number of strata K is a critical design parameter, yet Cochran's (1968) recommendation of K=5 has persisted for decades without a formal stability analysis.
Bayesian prediction intervals for time series forecasting carry an implicit promise: a nominal 95% interval should contain the realized value 95% of the time. We audited 120 published forecasting papers that report Bayesian prediction intervals, recomputing empirical coverage on held-out data using original code and data where available (n=47) and calibrated simulation otherwise (n=73).
Standard Markov chain Monte Carlo convergence diagnostics assume that chains have mixed across the full support of the target distribution, an assumption violated whenever the posterior is multimodal. We construct 500 synthetic multimodal targets (mixtures of 2-8 Gaussians in 5-50 dimensions) and run four samplers (HMC, NUTS, Gibbs, Metropolis-Hastings) on each, then apply five convergence diagnostics: classical R-hat, split-R-hat, effective sample size, Geweke's spectral test, and visual trace-plot assessment.
Generalized additive models (GAMs) fitted via penalized regression splines report an effective degrees of freedom (edf) for each smooth term, a quantity that controls inference, model comparison, and residual degrees of freedom. We reanalyze 80 published GAM analyses by refitting each model in mgcv under corrected boundary penalty handling and find that 60% underreport edf by 15-40%.
We introduce the Outlier Leverage Ratio (OLR), a Cook's distance analog tailored for random-effects meta-analysis that quantifies how much each study shifts the pooled effect estimate. Applying the OLR to 200 meta-analyses drawn from the Cochrane Database of Systematic Reviews, we find that removing studies exceeding the 4/k threshold reverses the direction or statistical significance of the pooled conclusion in 29% of cases.
The variance inflation factor (VIF) with a threshold of 10 remains the dominant heuristic for detecting multicollinearity in regression analysis, yet this threshold was derived under asymptotic assumptions without explicit dependence on sample size. Through a simulation study comprising 100,000 Monte Carlo runs across 240 design configurations varying sample size (n = 30 to 10,000), number of predictors (p = 3 to 50), and true collinearity structure, we demonstrate that the VIF > 10 rule produces a 40% false negative rate at n = 50 and a 25% false positive rate at n = 5,000.
The fragility index for dichotomous outcomes quantifies how many event status changes reverse a trial's statistical significance, but no analogous metric exists for time-to-event endpoints. We define the Concordance Fragility Index (CFI) as the minimum number of patient exclusions required to reverse the conclusion of a survival analysis — either flipping the hazard ratio across 1.
Probability calibration of clinical risk models degrades over time as patient populations shift, yet no standardized metric quantifies this deterioration rate. We introduce the Calibration Decay Index (CDI), defined as the rate parameter in a logarithmic model of expected calibration error (ECE) growth over temporal displacement.
We train 1200 models spanning 5 architectures, 8 weight decay values, 6 learning rates, and 5 random seeds on CIFAR-100 and ImageNet to map the joint loss landscape of weight decay and learning rate. The optimal weight decay follows a linear relationship with learning rate: lambda star equals rho times eta, where rho equals 0.