This paper investigates the relationship between goal misgeneralization and reward models through controlled experiments on 16 diverse datasets totaling 12,675 samples. We propose a novel methodology that achieves 11.
We present a systematic empirical study examining machine translation across 14 benchmarks and 31,445 evaluation instances. Our analysis reveals that quality estimation plays a more critical role than previously recognized, achieving 0.
We present a systematic empirical study examining vision transformers across 26 benchmarks and 14,511 evaluation instances. Our analysis reveals that patch size plays a more critical role than previously recognized, achieving 0.
We conduct the largest study to date on mutation testing, analyzing 37,945 instances across 5 datasets spanning multiple domains. Our key finding is that semantic diversity accounts for 17.
This paper investigates the relationship between continuous integration and build failures through controlled experiments on 23 diverse datasets totaling 27,487 samples. We propose a novel methodology that achieves 14.
We present a systematic empirical study examining semantic segmentation across 9 benchmarks and 36,089 evaluation instances. Our analysis reveals that satellite imagery plays a more critical role than previously recognized, achieving 0.
We conduct the largest study to date on backtracking, analyzing 38,847 instances across 12 datasets spanning multiple domains. Our key finding is that search accounts for 32.
This paper investigates the relationship between relation extraction and cross lingual through controlled experiments on 15 diverse datasets totaling 10,058 samples. We propose a novel methodology that achieves 12.
We present a large-scale failure analysis of tool-using autonomous agents across 50,247 execution trajectories spanning 12 agentic benchmarks. Contrary to the prevailing hypothesis that planning errors dominate agent failures, we find that 61.
Classical stability margins---gain margin (GM) and phase margin (PM)---remain the primary robustness indicators taught in control engineering curricula and applied in industrial practice. Both margins are derived from the loop transfer function evaluated on the Nyquist contour, yet they quantify robustness against different perturbation types: GM against multiplicative gain uncertainty and PM against pure time-delay uncertainty.
A pervasive assumption in software engineering practice is that code review duration scales primarily with diff size, measured as lines added plus lines deleted. This assumption underpins tooling that flags large diffs, team policies that encourage smaller pull requests, and scheduling heuristics that allocate reviewer time proportional to change magnitude.
Cross-lingual transfer in multilingual language models is commonly explained by typological similarity between languages, measured through features such as word order, morphological complexity, and phonological inventory. We propose a simpler and more proximate predictor: the Vocabulary Overlap Ratio (VOR), defined as the Jaccard similarity between the subword token sets that a multilingual tokenizer assigns to monolingual corpora in two languages.
Overparameterized neural networks are widely believed to gracefully handle label noise because their excess capacity can absorb corrupted examples without degrading clean-sample performance. We directly test this assumption by training 2,400 models spanning four architectures (ResNet-18, VGG-16, DenseNet-121, ViT-Small) at five width multipliers (0.
The King graph K_n places vertices on the n x n squares of a chessboard, with two vertices adjacent whenever a chess king can move between them in a single step. We determine the minimum dominating set size gamma(K_n) for all n from 1 to 10 by combining integer linear programming with symmetry-breaking constraints derived from the dihedral group D_4 acting on the board.
Microsatellite instability (MSI) is a critical biomarker for colorectal cancer (CRC) prognosis and immunotherapy response prediction. Approximately 15% of non-metastatic and 4–5% of metastatic CRCs exhibit MSI-high (MSI-H) status, defining a molecular subtype with distinct therapeutic implications.
Microsatellite instability (MSI) is a critical biomarker for colorectal cancer (CRC) prognosis and immunotherapy response prediction. While existing computational tools rely on read-count statistics or machine learning classifiers trained on fixed feature sets, they struggle with noisy sequencing data and cross-cohort generalization.
We compute the exact fractional chromatic number χ_f(K(n,k)) for all Kneser graphs K(n,k) with k ≤ 8 and 2k ≤ n ≤ 4k using linear programming relaxation of the standard integer chromatic number formulation. For each computed value, we provide an explicit LP certificate in the form of a dual feasible solution that verifies the lower bound, together with a primal fractional coloring achieving the upper bound.
We present a complete computer-assisted verification of the Antichain Width Conjecture for all finite partially ordered sets (posets) of width at most 6. The conjecture asserts that in any finite poset of width w, the maximum antichain can be partitioned into at most w chains that collectively cover the antichain.
We construct explicit 3-uniform hypergraphs that avoid complete 3-uniform subhypergraphs on 7 and 8 vertices, improving the best known lower bounds for the corresponding Turán densities. Our constructions employ a layered algebraic technique over finite fields GF(q), combining polynomial evaluation maps with carefully chosen forbidden triple configurations.
Computational prediction of protein stability changes upon mutation (ΔΔG) underpins rational protein engineering, yet the accuracy of these predictions has not been evaluated for systematic directional bias. We benchmarked six widely used ΔΔG predictors—FoldX, Rosetta ddg_monomer, DynaMut2, MAESTRO, PoPMuSiC, and ThermoNet—on a curated ProTherm-derived test set of 2,648 single-point mutations with experimentally measured stability changes.