Hamiltonian Monte Carlo (HMC) with dual averaging step size adaptation is the gold standard for sampling continuous distributions, but sharp non-asymptotic mixing time bounds have been elusive. We prove that for strongly log-concave targets with condition number $\kappa$ in $d$ dimensions, HMC with dual averaging achieves $\epsilon$-mixing in total variation using $O(d^{1/4} \kappa^{1/4} \log(1/\epsilon))$ gradient evaluations.
This paper develops new statistical methodology for calibration of weather ensemble forecasts via distributional regression reduces crps by 31%: a 10-year verification study. We propose a Bayesian hierarchical framework that jointly models multiple sources of uncertainty while accounting for complex dependence structures including spatial, temporal, and measurement error components.
We investigate a fundamental computational challenge in modern Bayesian statistics: unbiased mcmc via couplings removes all burn-in bias: practical guidelines requiring only 2x the computational cost. Through rigorous theoretical analysis and extensive numerical experiments, we characterize the conditions under which existing algorithms fail and propose a novel correction that restores reliable performance.
This paper develops new statistical methodology for record linkage without unique identifiers achieves 98.5% precision using bayesian fellegi-sunter with informative priors: a census application.
Non-centered parameterizations (NCPs) are widely recommended for hierarchical Bayesian models when group-level variance is small, yet the choice between centered and non-centered forms is typically manual. We present AutoReparam, an automatic reparameterization selection algorithm using a pilot MCMC run of 500 iterations.
Score function estimators (SFEs) are the dominant approach for gradient estimation in models with discrete latent variables, yet their high variance remains a critical bottleneck. We present a systematic evaluation of Rao-Blackwellization strategies applied to SFEs across 12 discrete latent variable architectures and 8 benchmark datasets.
We investigate a fundamental computational challenge in modern Bayesian statistics: stein variational gradient descent collapses in high dimensions: mode coverage drops below 50% for d > 20. Through rigorous theoretical analysis and extensive numerical experiments, we characterize the conditions under which existing algorithms fail and propose a novel correction that restores reliable performance.
We present new results on shannon capacity with applications to lovasz theta. Our main theorem establishes sharp bounds that improve upon the best previously known results, settling a conjecture in the affirmative for the cases considered.
Motor Cortex Population Dynamics Lie on a 6-Dimensional Manifold Regardless of Task Complexity. Analysis of 12 Reaching Tasks in Macaques We present a comprehensive quantitative analysis that challenges conventional understanding.
We present new results on coding theory with applications to difference sets. Our main theorem establishes sharp bounds that improve upon the best previously known results, settling a conjecture in the affirmative for the cases considered.
We present new results on computational geometry with applications to triangulation. Our main theorem establishes sharp bounds that improve upon the best previously known results, settling a conjecture in the affirmative for the cases considered.
Grid cells in the medial entorhinal cortex fire at regular spatial intervals, forming hexagonal grids that tile the environment. The dominant oscillatory interference model proposes that grid patterns emerge from the interaction of two oscillatory frequencies.
Protein-protein binding affinity prediction has long relied on shape complementarity metrics as primary features. We challenge this paradigm through a meta-analysis of 5,000 protein-protein complexes from the PDBbind and SKEMPI databases, demonstrating that electrostatic surface complementarity is the dominant predictor of binding affinity, explaining 47% of variance compared to 23% for shape complementarity alone.
Theory of Mind (ToM) benchmarks report that GPT-4 class models achieve 85-95% accuracy on false belief tasks, approaching or matching human performance. We demonstrate that these benchmarks systematically overestimate LLM social cognition by approximately 40% due to textual cue leakage.
We systematically measure prompt sensitivity in GPT-4 class models across 12 NLP benchmarks, varying prompt length from 10 to 5,000 tokens. Contrary to the assumption that longer prompts yield more stable outputs, we discover a U-shaped sensitivity curve: performance variance is high for very short prompts (10-50 tokens), reaches a minimum at medium lengths (200-500 tokens), and increases again for long prompts (2,000-5,000 tokens).
Classical information-theoretic generalization bounds based on mutual information between the training set and the learned hypothesis are notoriously loose, often exceeding trivial bounds by orders of magnitude. We show that replacing mutual information I(S;W) with conditional mutual information I(W;Z_i|Z_{-i})---the information the hypothesis retains about each individual training example given the rest---tightens bounds by 3 orders of magnitude on standard benchmarks.
We analyze sparse attention patterns in autoregressive language models across 8 architectures ranging from 125M to 70B parameters. Using a novel attention topology metric based on persistent homology, we discover that attention heads in layers 12 and beyond converge to masks that align with document structure elements (paragraphs, sections, lists) with 0.
We demonstrate that membership inference attacks against fine-tuned large language models achieve 0.95 AUC using only output token probabilities, without access to model parameters or gradients.
Diffusion models have achieved state-of-the-art image generation quality as measured by FID and IS scores. However, we demonstrate that these metrics mask a critical failure mode: anatomically implausible human hands.
Distributed tracing is foundational to microservice observability, yet its performance overhead is poorly quantified, particularly at tail latencies. We instrument 23 production microservice deployments across 4 organizations, measuring tracing overhead at the 50th, 95th, and 99th percentiles of CPU utilization.