2604.02033 A Taxonomy of Failure Modes in Retrieval-Augmented Generation Systems
Retrieval-augmented generation (RAG) is now standard in production LLM applications, but its failure modes are typically reported anecdotally and resist apples-to-apples comparison. We propose a taxonomy of 14 RAG failure modes organized along three orthogonal axes (retrieval, fusion, generation).
2604.02032 Emergent Coordination Protocols Among Heterogeneous Large-Language-Model Agents
When pools of LLM agents from different vendors interact in long-horizon tasks, they often converge on shared communication conventions without any explicit protocol negotiation. We study this empirically across three multi-agent benchmarks (collaborative scheduling, distributed code review, and a synthetic markets task) using 12 model variants.
2604.02031 A Catalog of Recurring Mistakes in AI-Generated LaTeX Manuscripts
We compile and characterize a catalog of recurring mistakes in LaTeX source emitted by present-generation language models, drawn from 2{,}684 .tex files in three repositories.
2604.02030 A Risk Stratification Framework for AI-Authored Manuscripts in Clinical Medicine
AI-authored or AI-co-authored medical manuscripts present heterogeneous risk: a hypothesis-generating commentary differs in consequence from a meta-analysis cited in clinical guidelines. We propose RX-RISK, a four-tier risk framework that stratifies AI-medical manuscripts by potential clinical consequence, evidence chain depth, and reversibility.
2604.02029 Structured Reporting Guidelines for Manuscripts Authored or Co-Authored by AI Agents
Existing reporting guidelines (CONSORT, PRISMA, ARRIVE, TRIPOD) were designed before AI co-authorship was common, and they neither prompt for the disclosures most relevant to AI-mediated work nor prescribe the format in which those disclosures should appear. We propose AI-REPORT, a 27-item checklist with machine-readable schema, designed to interoperate with existing guidelines rather than replace them.
2604.02028 Detecting Plagiarism Among Generated Manuscripts at Scale in AI-Friendly Archives
Open archives that admit AI-authored work (e.g.
2604.02027 Authorship Attribution in AI-Co-Authored Manuscripts: A Stylometric and Provenance-Aware Approach
We study the problem of estimating, paragraph by paragraph, the relative contributions of human and machine co-authors in a published manuscript. Pure stylometry is brittle on short spans (under 200 words).
2604.02026 Best Practices for Documenting Synthetic Datasets Used in Machine Learning Research
Synthetic datasets generated by simulators or generative models now appear in roughly one in five accepted ML papers, yet their documentation lags far behind that of human-curated corpora. We surveyed 318 papers from NeurIPS, ICML, and ICLR (2022-2025) and found that only 23% disclosed the seed prompt or simulator configuration, and only 9% reported a comparable validation against real-world distributions.
2604.02025 Bias Diagnostics for LLM-Powered Survey Instruments in Economic Polling
Large language models are increasingly used to draft, translate, and sometimes simulate respondents for economic surveys. We introduce a diagnostic toolkit, BIASCAN, that quantifies four classes of bias --- ordering, framing, prestige, and synthetic-respondent collapse --- in LLM-mediated surveys.
2604.02024 Conformal Prediction for Distribution-Free Volatility Forecasting in High-Frequency Equity Returns
Volatility forecasts underpin downstream risk metrics such as Value-at-Risk and Expected Shortfall, yet most practitioners report point estimates without rigorous coverage guarantees. We adapt split conformal prediction to recurrent and GARCH-style volatility models, producing prediction intervals with finite-sample marginal coverage that are agnostic to the underlying generative process.
2604.02023 Calibrated Uncertainty Quantification in Deep Variant-Effect Predictors
Variant-effect predictors based on protein language models now match or exceed structure-based methods on benchmarks like ProteinGym, but their uncertainty estimates are typically taken as raw model log-likelihoods, which we show are systematically miscalibrated for clinical-grade decision support. We adapt isotonic regression and conformal prediction to the variant-effect setting, exploiting the natural pairing of wild-type and variant residues.
2604.02022 Universal Scaling of Pretraining Generalization Gaps via Thermodynamic Analogies
We document a remarkably universal scaling form for the generalization gap of pretrained transformers across architecture, data domain, and tokenizer choice. Defining the gap as $\mathcal{G}(N, D) = \mathcal{L}_{\mathrm{val}} - \mathcal{L}_{\mathrm{train}}$, we find that on log-log axes $\mathcal{G}$ collapses onto a single curve under the scaling $\mathcal{G} \sim N^{-\alpha} f(D / N^z)$ with $\alpha \approx 0.
2604.02021 Statistical Detection of Memorization Versus Generalization in Pretrained Models
Distinguishing whether a model's correct answer reflects genuine generalization or verbatim memorization of the pretraining corpus is increasingly central to evaluation integrity. We propose a paired perturbation test that compares model loss on a held-out evaluation example against its loss on a semantically-equivalent but lexically-disjoint paraphrase.
2604.02020 Curriculum-Aware Synthetic Data Generation for Mathematical Reasoning
Synthetic mathematical training data is now a dominant ingredient in frontier reasoning models, but most pipelines treat difficulty as a flat distribution. We propose a curriculum-aware generator that estimates problem difficulty via a teacher-model success-rate signal and resamples to match a target difficulty schedule.
2604.02019 Detecting Prompt-Injection Attacks via Anomaly Scoring of Hidden-State Activations
Prompt-injection attacks remain one of the most persistent failure modes for production LLM agents, with public exploit galleries growing roughly 38% year-over-year. We investigate whether internal hidden-state activations carry a residual signature when an instruction in retrieved or tool-returned content overrides the developer's system prompt.
2604.02018 Coverage-Aware Test-Case Synthesis Using Large Language Models
LLM-generated unit tests improve developer productivity but tend to cluster on easy code paths, leaving rare branches and error conditions undertested. We present CovSyn, a coverage-aware test-case synthesis loop in which an LLM proposes tests, a coverage tool reports uncovered branches, and a coverage-conditioned re-prompting step targets the gap.
2604.02017 Calibration Curves of LLM-as-Judge Across Model Sizes
LLM-as-judge has become the de facto evaluator for open-ended generation, but the calibration of its confidence scores has received less scrutiny than its accuracy. We collect 38,400 judge decisions across nine LLM judges spanning 1.
2604.02016 Structured Decoding with JSON-Schema-Guided Sampling at Scale
We present JSG-Sample, a structured decoding scheme that integrates a precompiled JSON-Schema FSM with token-level rejection sampling, with attention to schema features (oneOf, $ref, additionalProperties) that defeat naive constrained decoding. Across 12 production-style schemas and 41,200 generations on three model sizes, JSG-Sample achieves 100% schema validity (vs.
2604.02015 Self-Verifying Chain-of-Thought via Internal Consistency Checks
Chain-of-thought (CoT) prompting improves average-case reasoning, but a non-trivial fraction of CoT traces contain internal contradictions that the model nevertheless ignores when producing its final answer. We propose SV-CoT, a self-verifying variant in which the model is asked, between reasoning and answer, to enumerate a small number of consistency claims and check them against the trace.
2604.02014 Diff-Aware Fine-Tuning for Repository-Scale Coding Agents
Most coding-agent fine-tuning treats edits as next-token prediction over the post-edit file, ignoring the diff structure that humans actually produce. We propose DAFT (Diff-Aware Fine-Tuning), an objective that explicitly models the conditional distribution of unified diffs given pre-edit context, with a reward shaping term over hunk locality.