Single-cell RNA sequencing has become the dominant technology for characterizing cellular heterogeneity, yet the stability of computational cell-type assignments remains poorly quantified. We systematically evaluated clustering reproducibility by running the standard Seurat pipeline (PCA dimensionality reduction, UMAP embedding, Louvain community detection) across 100 random seeds on each of 10 published scRNA-seq datasets spanning 847,000 cells total.
Whole-brain multivariate pattern analysis is widely assumed to outperform region-of-interest approaches by leveraging distributed neural representations. We tested this assumption by training linear support vector machine decoders on six fMRI task datasets—including the Human Connectome Project working memory and motor tasks, the Haxby face/object paradigm, and three additional cognitive paradigms—systematically varying the number of ANOVA-selected voxels from 10 to 5,000.
Molecular docking scoring functions remain central to computational drug discovery pipelines, yet their quantitative accuracy against experimental binding affinities is rarely audited at scale. We benchmarked four widely deployed scoring functions—AutoDock Vina, Glide SP, GOLD ChemScore, and RF-Score—against 5,316 protein-ligand complexes from the PDBbind v2020 refined set, computing Pearson correlations between predicted scores and experimental -log(Ki/Kd) values.
Stan's Hamiltonian Monte Carlo sampler relies on automatic differentiation (AD) to compute gradients of the log-posterior density. These gradients are assumed to be exact, but numerical issues in user-written models can cause the AD gradient to diverge from the true mathematical gradient.
Bayesian prediction intervals for time series forecasting carry an implicit promise: a nominal 95% interval should contain the realized value 95% of the time. We audited 120 published forecasting papers that report Bayesian prediction intervals, recomputing empirical coverage on held-out data using original code and data where available (n=47) and calibrated simulation otherwise (n=73).
Standard Markov chain Monte Carlo convergence diagnostics assume that chains have mixed across the full support of the target distribution, an assumption violated whenever the posterior is multimodal. We construct 500 synthetic multimodal targets (mixtures of 2-8 Gaussians in 5-50 dimensions) and run four samplers (HMC, NUTS, Gibbs, Metropolis-Hastings) on each, then apply five convergence diagnostics: classical R-hat, split-R-hat, effective sample size, Geweke's spectral test, and visual trace-plot assessment.
Probability calibration of clinical risk models degrades over time as patient populations shift, yet no standardized metric quantifies this deterioration rate. We introduce the Calibration Decay Index (CDI), defined as the rate parameter in a logarithmic model of expected calibration error (ECE) growth over temporal displacement.
LATAM-RX adjusts rheumatology clinical decision support for Latin American practice realities including TB burden, insurance formulary limitations (IMSS/ISSSTE), endemic infection screening, diagnostic delays, and access fragility. Four-domain composite with GLADEL/PANLAR/COPCORD references.
FLARE-BEFORE-FLARE models preclinical flare detection using wearable-derived digital biomarkers and patient-reported outcomes. Eight-domain personal z-score deviation with weighted composite scoring and pattern classification (inflammatory, musculoskeletal, fatigue-sleep).
RHEUM-POLYSHIELD aggregates retinal toxicity, glucocorticoid-induced osteoporosis, infection risk, and QT hazard flags into a unified safety profile for rheumatology patients under chronic immunomodulation. Four-domain weighted heuristic with text alerts.
LUPUS-DRIFT models systemic lupus erythematosus as a longitudinal trajectory problem integrating serologic activity, renal signals, treatment burden, and flare tendency with a Zamora-PCT bridge for infection-vs-flare differentiation. Literature-informed heuristic for transparent surveillance support.
SSc-COMPASS is a transparent multimodal risk-layering skill for systemic sclerosis integrating cutaneous subtype, serology, capillaroscopy, pulmonary physiology, HRCT burden, and cardiopulmonary markers. It classifies patients into ILD progression risk, vasculopathy risk, and PAH flag domains with weighted composite trajectory output.
We train 1200 models spanning 5 architectures, 8 weight decay values, 6 learning rates, and 5 random seeds on CIFAR-100 and ImageNet to map the joint loss landscape of weight decay and learning rate. The optimal weight decay follows a linear relationship with learning rate: lambda star equals rho times eta, where rho equals 0.
We construct the smallest known graded Artinian Gorenstein algebras whose Hilbert functions fail to be unimodal. In codimension 5 we exhibit an algebra with Hilbert function (1, 5, 15, 34, 55, 53, 55, 34, 15, 5, 1), featuring a dip at degree 5 that violates unimodality.
We train 480 models spanning 8 architectures, 6 RandAugment magnitude levels, and 10 random seeds on ImageNet-1K to measure the architecture-specific augmentation saturation point (ASP). CNNs reach saturation at magnitude 9, while Vision Transformers saturate later at magnitude 14.
Analog-to-digital converter datasheets report effective number of bits (ENOB), but this single figure conceals a nonlinear transition in how quantization noise accumulates as resolution increases. We define the Quantization Degradation Index (QDI) as the gap between ideal and measured signal-to-noise ratio and characterize it across a full factorial design of 7 converter architectures, 5 signal types, 9 resolutions (4 to 20 bits), and 9 oversampling ratios (1x to 256x), totalling 2,835 configurations tested in calibrated simulation.
Minor surface-level changes to a prompt — synonym substitution, whitespace adjustment, instruction reordering — can shift large language model accuracy by double-digit percentage points, yet no quantitative law describes how this fragility evolves with the number of in-context examples. We define the Prompt Sensitivity Index (PSI) as the standard deviation of accuracy across 50 semantically equivalent rephrasings of the same prompt template and measure it for 6 LLMs on 4 benchmarks at 7 context lengths from zero-shot to 32-shot.
Flux Balance Analysis (FBA) predicts gene essentiality by simulating single-gene knockouts in genome-scale metabolic models. We ask: how well does FBA-predicted essentiality rank antimicrobial drug targets, and when does adding flux topology improve the ranking?
The minimum dominating set problem in Kneser graphs K(n,k) is a classical question in combinatorial optimization, yet the monotonicity of the domination number gamma(K(n,k)) in n for fixed k has remained unresolved for k >= 3. We introduce the Spectral Degeneracy Index (SDI), defined as the ratio of the second-largest eigenvalue to the algebraic connectivity, and prove that non-monotonicity of gamma occurs precisely when SDI exceeds an explicitly computable threshold tau_k.
Subword tokenizers underpin every modern language model, yet their coverage characteristics across the world's languages remain poorly quantified. We introduce the Fertility-Gap Predictor (FGP), a diagnostic framework that exactly enumerates the character-to-subword mapping for every Unicode codepoint attested in 47 languages across 8 widely deployed tokenizers (GPT-4 cl100k, LLaMA-3 tiktoken, Gemma SentencePiece, Mistral SentencePiece, BLOOM BPE, mBERT WordPiece, XLM-R SentencePiece, and Qwen BPE).