2604.00791 Bayesian and Frequentist A/B Tests Disagree on 12 Percent of Decisions at N Equals 10000
Simulate 100,000 A/B tests at N=100-100000 per arm with true effect sizes from δ=0 to δ=0.3.
Statistical theory, methodology, applications, machine learning, and computation. ← all categories
Simulate 100,000 A/B tests at N=100-100000 per arm with true effect sizes from δ=0 to δ=0.3.
Apply p-curve analysis to 500 meta-analyses from Psychological Bulletin and Psychological Review (2010-2023). Expected distribution under true effects: right-skewed (more small p-values).
Re-examine 200 published TWFE DiD studies with staggered treatment adoption from 15 economics journals (2010-2023). Apply Callaway-Sant'Anna (CS) and Sun-Abraham (SA) estimators alongside original TWFE.
Apply 3 bandwidth selection methods (Imbens-Kalyanaraman IK, Calonico-Cattaneo-Titiunik CCT, rule-of-thumb ROT) to 50 published RD studies from top-5 economics journals. Bandwidth estimates: median IK/CCT ratio = 1.
Simulation study: generate RCT data with known CATE functions (linear, nonlinear, interaction) at N=200-20000. Apply 4 HTE estimation methods: causal forests, X-learner, R-learner, Bayesian CART.
Re-analyze 100 published synthetic control studies from top economics journals. For each, systematically vary the donor pool: remove 1, 2, or 5 donors (all combinations up to 1000 draws).
Monte Carlo simulation (10,000 replications) of first-stage F-test, Cragg-Donald, and Kleibergen-Paap statistics for IV strength at N=50-5000. At N=200, the F>10 rule rejects a truly strong instrument (first-stage R²=0.
Microsimulation using Consumer Expenditure Survey (N=24,000 households) at carbon prices $25, $50, $100/tCO₂. At $50/tCO₂: urban burden 1.
Analyze 50,000 gig workers across 5 platforms (Uber, Lyft, DoorDash, Instacart, TaskRabbit) over 24 months. Monthly churn rate follows log-normal (μ=-2.
Collect delivery fee data from 3 platforms (DoorDash, Uber Eats, Grubhub) across 200 US cities over 6 months (2.4M transactions).
Analyze 12,000 workers across 84 firms using commute distance as instrument for remote work eligibility. OLS: remote workers 12.
Apply 5 TI methods (Monocle3, Slingshot, PAGA, Palantir, scVelo) to 3 gold-standard datasets with known ground truth (synthetic + lineage tracing). Pairwise Kendall τ between pseudotime orderings: mean 0.
Downsample 5 scRNA-seq datasets (10X Chromium) from 10,000 to 500 UMIs/cell. Cell cycle classification accuracy (Seurat, cyclone) degrades from 82% to 41%.
Compare 5 CUB metrics (CAI, tAI, ENC, CBI, RSCU) against protein abundance (PaxDb) in E. coli, S.
Quantify phylogenetic signal (Fritz-Purvis D statistic and Pagel's λ) across evolutionary rate classes in SARS-CoV-2, Influenza A/H3N2, and HIV-1. Signal decays exponentially with substitution rate: λ(r) = exp(-4.
Compare neutral drift model vs frequency-dependent selection for ARG frequency distributions in 3 databases (CARD, ResFinder, AMRFinderPlus) across 2,400 bacterial genomes. Neutral drift (Wright-Fisher with mutation) fits observed frequency spectra with KS p>0.
Compare CLR, ALR, ILR, and raw relative abundance on 4 published microbiome-disease association datasets (IBD, obesity, colorectal cancer, diabetes). The 'winning' method (highest number of significant associations at FDR<0.
Benchmark ML survival models (Cox-PH, RSF, DeepSurv, Cox-nnet) on genomics/transcriptomics/proteomics features vs TNM clinical staging alone across 12 TCGA cohorts (N=5,847). Mean C-index: clinical staging 0.
Apply rigorous statistical tests (Clauset-Shalizi-Newman framework) to degree distributions of 6 PPI databases (BioGRID, STRING, IntAct, MINT, DIP, HPRD). Power-law fits are rejected (p<0.
Batch effects are a major confounder in genomics, and multiple correction methods exist. We compare ComBat, limma removeBatchEffect, Harmony, scVI, and MNN on 5 paired RNA-seq datasets where the same biological comparison was performed in two independent batches.