2604.01356 Codon Pair Bias, Not Individual Codon Bias, Predicts Protein Abundance in Human Tissues with R-Squared 0.61
Codon Pair Bias, Not Individual Codon Bias, Predicts Protein Abundance in Human Tissues with R-Squared 0.61.
Statistical theory, methodology, applications, machine learning, and computation. ← all categories
Codon Pair Bias, Not Individual Codon Bias, Predicts Protein Abundance in Human Tissues with R-Squared 0.61.
Three Null Models Reveal That Wobble-Position GC Content, Not Selection, Drives Codon Usage Bias in 847 Bacterial Genomes. We present a comprehensive quantitative analysis that challenges conventional understanding.
We provide causal evidence that trade liberalization widens the urban–rural wage gap: a general equilibrium analysis of 27 developing economies. Our identification strategy combines quasi-experimental variation with state-of-the-art econometric techniques including difference-in-differences with staggered treatment adoption, instrumental variables estimation, and regression discontinuity designs.
Ribosome Profiling Reveals That Rare Codons Accelerate, Not Decelerate, Translation at 2,341 Co-Translational Folding Boundaries. We present a comprehensive quantitative analysis that challenges conventional understanding.
We provide causal evidence that refugee inflows increase host country innovation by 12% in border regions: patent evidence from turkey and jordan. Our identification strategy combines quasi-experimental variation with state-of-the-art econometric techniques including difference-in-differences with staggered treatment adoption, instrumental variables estimation, and regression discontinuity designs.
This paper investigates the econometric foundations underlying weak instruments bias iv estimates toward ols by exactly (1 - 1/f) when errors are normal: a finite-sample result. Using a combination of Monte Carlo simulations, analytical derivations, and empirical applications, we demonstrate that conventional approaches suffer from previously unrecognized biases.
Information-Theoretic Decomposition of Mutual Information Between Genotype and Phenotype Reveals 40% Attributable to Epistatic Interactions in Yeast Fitness Landscapes. We present a comprehensive quantitative analysis that challenges conventional understanding.
The fitness cost of antibiotic resistance mutations is considered a key factor governing resistance dynamics, yet most estimates come from a handful of genetic backgrounds. We systematically measure the fitness cost of 12 common resistance mutations across 4,096 Escherichia coli genotypes constructed via combinatorial assembly of 12 neutral marker loci.
CpG dinucleotides are depleted in mammalian genomes due to spontaneous deamination of methylated cytosines, and this depletion has been proposed as the primary driver of codon usage bias. Using a causal inference framework (do-calculus and instrumental variable analysis) applied to 1,200 mammalian transcriptomes, we demonstrate that CpG depletion is necessary but not sufficient for codon bias.
Simpson's paradox, where a trend appearing in aggregated data reverses when stratified by a confounding variable, poses a fundamental threat to the validity of genome-wide association studies (GWAS) that aggregate across ancestral populations. We systematically re-analyze 8,400 genome-wide significant associations from the GWAS Catalog, stratifying each by five major continental ancestry groups (European, East Asian, South Asian, African, Admixed American).
Hidden Markov models (HMMs) are widely used for circadian rhythm analysis of actigraphy data, but standard HMMs assume geometric state-duration distributions that poorly capture the biology of circadian phase shifts. We develop Duration-HMM (D-HMM), which replaces geometric durations with explicit negative binomial duration distributions for each hidden state.
This paper investigates the econometric foundations underlying double machine learning estimators have 40% higher finite-sample bias than claimed: evidence from 1,000 dgps. Using a combination of Monte Carlo simulations, analytical derivations, and empirical applications, we demonstrate that conventional approaches suffer from previously unrecognized biases.
This paper investigates the econometric foundations underlying matrix completion methods for synthetic controls outperform convex weight estimators by 28% in rmse: a comparison across 500 simulations. Using a combination of Monte Carlo simulations, analytical derivations, and empirical applications, we demonstrate that conventional approaches suffer from previously unrecognized biases.
Continuous-time Markov chain (CTMC) models are the foundation of phylogenetic inference, yet their adequacy at individual alignment sites is rarely tested. We perform posterior predictive checks on 500 protein families from Pfam using site-specific test statistics including mean substitution rate, rate variance, and compositional heterogeneity.
We provide causal evidence that remittances increase household consumption smoothing by 53% during droughts: mobile money vs. hawala channels in somalia.
This paper investigates the econometric foundations underlying panel data models with interactive fixed effects: a nuclear norm penalization approach that outperforms pc by 35%. Using a combination of Monte Carlo simulations, analytical derivations, and empirical applications, we demonstrate that conventional approaches suffer from previously unrecognized biases.
We systematically measure prompt sensitivity in GPT-4 class models across 12 NLP benchmarks, varying prompt length from 10 to 5,000 tokens. Contrary to the assumption that longer prompts yield more stable outputs, we discover a U-shaped sensitivity curve: performance variance is high for very short prompts (10-50 tokens), reaches a minimum at medium lengths (200-500 tokens), and increases again for long prompts (2,000-5,000 tokens).
Classical information-theoretic generalization bounds based on mutual information between the training set and the learned hypothesis are notoriously loose, often exceeding trivial bounds by orders of magnitude. We show that replacing mutual information I(S;W) with conditional mutual information I(W;Z_i|Z_{-i})---the information the hypothesis retains about each individual training example given the rest---tightens bounds by 3 orders of magnitude on standard benchmarks.
We analyze sparse attention patterns in autoregressive language models across 8 architectures ranging from 125M to 70B parameters. Using a novel attention topology metric based on persistent homology, we discover that attention heads in layers 12 and beyond converge to masks that align with document structure elements (paragraphs, sections, lists) with 0.
This paper investigates the econometric foundations underlying synthetic control methods fail when pre-treatment fit is below r² = 0.85: a placebo-based calibration.