This paper investigates the econometric foundations underlying matrix completion methods for synthetic controls outperform convex weight estimators by 28% in rmse: a comparison across 500 simulations. Using a combination of Monte Carlo simulations, analytical derivations, and empirical applications, we demonstrate that conventional approaches suffer from previously unrecognized biases.
Continuous-time Markov chain (CTMC) models are the foundation of phylogenetic inference, yet their adequacy at individual alignment sites is rarely tested. We perform posterior predictive checks on 500 protein families from Pfam using site-specific test statistics including mean substitution rate, rate variance, and compositional heterogeneity.
We provide causal evidence that remittances increase household consumption smoothing by 53% during droughts: mobile money vs. hawala channels in somalia.
This paper investigates the econometric foundations underlying panel data models with interactive fixed effects: a nuclear norm penalization approach that outperforms pc by 35%. Using a combination of Monte Carlo simulations, analytical derivations, and empirical applications, we demonstrate that conventional approaches suffer from previously unrecognized biases.
We systematically measure prompt sensitivity in GPT-4 class models across 12 NLP benchmarks, varying prompt length from 10 to 5,000 tokens. Contrary to the assumption that longer prompts yield more stable outputs, we discover a U-shaped sensitivity curve: performance variance is high for very short prompts (10-50 tokens), reaches a minimum at medium lengths (200-500 tokens), and increases again for long prompts (2,000-5,000 tokens).
Classical information-theoretic generalization bounds based on mutual information between the training set and the learned hypothesis are notoriously loose, often exceeding trivial bounds by orders of magnitude. We show that replacing mutual information I(S;W) with conditional mutual information I(W;Z_i|Z_{-i})---the information the hypothesis retains about each individual training example given the rest---tightens bounds by 3 orders of magnitude on standard benchmarks.
We analyze sparse attention patterns in autoregressive language models across 8 architectures ranging from 125M to 70B parameters. Using a novel attention topology metric based on persistent homology, we discover that attention heads in layers 12 and beyond converge to masks that align with document structure elements (paragraphs, sections, lists) with 0.
This paper investigates the econometric foundations underlying synthetic control methods fail when pre-treatment fit is below r² = 0.85: a placebo-based calibration.
Diffusion models have achieved state-of-the-art image generation quality as measured by FID and IS scores. However, we demonstrate that these metrics mask a critical failure mode: anatomically implausible human hands.
Continual learning methods are universally evaluated under a discrete task-boundary assumption, where distribution shifts occur instantaneously between clearly delineated tasks. We argue this assumption is ecologically invalid and demonstrate that five leading continual learning methods (EWC, SI, PackNet, ER, DER++) fail catastrophically when task boundaries are gradual.
We empirically characterize how inference-time compute scales with task performance for agentic AI workloads. Across 14 agentic benchmarks spanning web navigation, code generation with tool use, and multi-step reasoning, we find that performance follows a power law with exponent 0.
This paper investigates the relationship between morphology and pretraining through controlled experiments on 23 diverse datasets totaling 26,178 samples. We propose a novel methodology that achieves 9.
This study presents a comprehensive quantitative analysis of blocking events and its relationship to subseasonal prediction, drawing on multiple decades of observational data and high-resolution numerical simulations. We develop a novel statistical framework combining wavelet decomposition, Granger causality testing, and bootstrapped trend analysis to establish robust quantitative findings.
We present a systematic empirical study examining vision transformers across 16 benchmarks and 36,025 evaluation instances. Our analysis reveals that attention plays a more critical role than previously recognized, achieving 0.
We conduct the largest study to date on supply chain, analyzing 27,437 instances across 18 datasets spanning multiple domains. Our key finding is that ml security accounts for 25.
We conduct the largest study to date on genetic programming, analyzing 20,335 instances across 22 datasets spanning multiple domains. Our key finding is that symbolic regression accounts for 32.
This paper investigates the relationship between intrinsic motivation and exploration through controlled experiments on 26 diverse datasets totaling 10,885 samples. We propose a novel methodology that achieves 31.
We present a systematic empirical study examining gradient dynamics across 26 benchmarks and 46,591 evaluation instances. Our analysis reveals that phase transitions plays a more critical role than previously recognized, achieving 0.
This study presents a comprehensive quantitative analysis of volcanic eruptions and its relationship to repose intervals, drawing on multiple decades of observational data and high-resolution numerical simulations. We develop a novel statistical framework combining wavelet decomposition, Granger causality testing, and bootstrapped trend analysis to establish robust quantitative findings.
This paper investigates the relationship between curriculum learning and data geometry through controlled experiments on 12 diverse datasets totaling 46,152 samples. We propose a novel methodology that achieves 29.