This paper investigates the relationship between goal misgeneralization and reward models through controlled experiments on 16 diverse datasets totaling 12,675 samples. We propose a novel methodology that achieves 11.
This paper investigates the relationship between tokenization and cross lingual through controlled experiments on 24 diverse datasets totaling 39,828 samples. We propose a novel methodology that achieves 13.
We present a systematic empirical study examining machine translation across 14 benchmarks and 31,445 evaluation instances. Our analysis reveals that quality estimation plays a more critical role than previously recognized, achieving 0.
We present a systematic empirical study examining vision transformers across 26 benchmarks and 14,511 evaluation instances. Our analysis reveals that patch size plays a more critical role than previously recognized, achieving 0.
We conduct the largest study to date on mutation testing, analyzing 37,945 instances across 5 datasets spanning multiple domains. Our key finding is that semantic diversity accounts for 17.
This paper investigates the relationship between continuous integration and build failures through controlled experiments on 23 diverse datasets totaling 27,487 samples. We propose a novel methodology that achieves 14.
We present a systematic empirical study examining semantic segmentation across 9 benchmarks and 36,089 evaluation instances. Our analysis reveals that satellite imagery plays a more critical role than previously recognized, achieving 0.
We conduct the largest study to date on backtracking, analyzing 38,847 instances across 12 datasets spanning multiple domains. Our key finding is that search accounts for 32.
This paper investigates the relationship between relation extraction and cross lingual through controlled experiments on 15 diverse datasets totaling 10,058 samples. We propose a novel methodology that achieves 12.
We present a large-scale failure analysis of tool-using autonomous agents across 50,247 execution trajectories spanning 12 agentic benchmarks. Contrary to the prevailing hypothesis that planning errors dominate agent failures, we find that 61.
Classical stability margins---gain margin (GM) and phase margin (PM)---remain the primary robustness indicators taught in control engineering curricula and applied in industrial practice. Both margins are derived from the loop transfer function evaluated on the Nyquist contour, yet they quantify robustness against different perturbation types: GM against multiplicative gain uncertainty and PM against pure time-delay uncertainty.
Portfolio diversification admits multiple quantitative definitions, yet practitioners rarely examine whether different metrics yield the same qualitative conclusion about sector concentration. We compute five diversification metrics---the Herfindahl-Hirschman Index (HHI), Shannon entropy, effective number of bets, the Choueifaty-Coignard diversification ratio, and maximum drawdown contribution share---for the 11 Global Industry Classification Standard (GICS) sectors using publicly available S&P 500 market-capitalization weights.
A pervasive assumption in software engineering practice is that code review duration scales primarily with diff size, measured as lines added plus lines deleted. This assumption underpins tooling that flags large diffs, team policies that encourage smaller pull requests, and scheduling heuristics that allocate reviewer time proportional to change magnitude.
Phylogenetic signal, the tendency of closely related species to resemble each other more than expected by chance, is routinely quantified by two metrics: Blomberg's K and Pagel's lambda. Both equal unity under Brownian motion, yet they capture different aspects of trait distribution across a phylogeny.
Empirical scaling laws of the form Y = aX^alpha are ubiquitous in physics, yet the dimensional consistency of the reported prefactor a is rarely examined. When X and Y carry physical dimensions, the prefactor must have dimensions [Y][X]^{-alpha} to render the equation dimensionally homogeneous, and these dimensions generally depend on the numerical value of the fitted exponent.
Pearson's r, Spearman's rho, and Kendall's tau are the three most widely used measures of bivariate association, yet practitioners rarely consider that these coefficients can disagree not merely in magnitude but in sign. We derive exact analytical conditions under which sign disagreement occurs between pairs of these measures as a function of marginal skewness and copula structure.
Cross-lingual transfer in multilingual language models is commonly explained by typological similarity between languages, measured through features such as word order, morphological complexity, and phonological inventory. We propose a simpler and more proximate predictor: the Vocabulary Overlap Ratio (VOR), defined as the Jaccard similarity between the subword token sets that a multilingual tokenizer assigns to monolingual corpora in two languages.
Gross Domestic Product can be measured from three conceptually equivalent approaches: expenditure, production (value-added), and income. National accounting identities guarantee their theoretical equality, yet in practice the three estimates diverge due to measurement error, survey timing, and revision practices.
Multiple testing correction is a routine component of statistical analysis, yet the choice among correction methods (Bonferroni, Holm, Benjamini-Hochberg FDR) is often treated as a technical detail rather than a consequential analytical decision. We surveyed 200 papers published between 2020 and 2023 in five journals (Nature, Science, PNAS, JAMA, PLoS ONE) that reported results from multiple simultaneous hypothesis tests.
Microbiome sequencing yields compositional data: read counts for each taxon represent relative abundances constrained to sum to a constant. Applying standard statistical methods (Pearson correlation, linear regression, t-tests on proportions) to such data produces spurious associations because an increase in one component mechanically forces decreases in others.