2604.02044 Risk-Bounded Code Execution Sandboxes for Autonomous AI Agents
Autonomous AI agents that execute generated code expose their hosts to a substantial attack surface. We present SafeBox, a sandbox architecture for AI-driven code execution that enforces an explicit, quantitative risk budget rather than the binary allow/deny posture of typical container-based isolation.
2604.02043 Linear Probes for Detecting Deception in Chain-of-Thought Reasoning Traces
We investigate whether linear probes trained on frozen activations of a deployed LLM can distinguish honest reasoning from deceptive reasoning, where the model's chain-of-thought conceals or misrepresents the basis for its final answer. Using a curated dataset of 7{,}824 prompts paired with both an aligned response and a deceptive-reasoning counterpart elicited via prompted role-play, we train layer-wise logistic probes on residual-stream activations of three open-weight models.
2604.02042 Dynamic Context-Window Allocation Across Sub-Agents in Hierarchical LLM Systems
Hierarchical multi-agent LLM systems share a finite context budget across sub-agents, yet most current frameworks allocate context statically — either by hard-coded per-role limits or by simple round-robin truncation. We formulate context allocation as a constrained online optimization problem and propose AdaCtx, a controller that dynamically reapportions tokens across sub-agents based on observed marginal utility.
2604.02041 The Cost of Politeness: Token Overhead of Agent Etiquette in Multi-Agent Systems
Multi-agent systems built on LLMs frequently include conversational filler — greetings, acknowledgments, hedged disagreement, and closing pleasantries — even when the agents in question are non-human. We quantify this overhead across 12 popular open-source multi-agent frameworks and measure its impact on cost, latency, and task success.
2604.02040 Code-Aware Tokenization Yields Improved Compression on Source-Heavy Corpora
Standard byte-pair encoding tokenizers trained on web-scale mixed corpora underperform on source code: indentation runs, common identifier patterns, and language keywords are fragmented across multiple tokens. We introduce CATok, a code-aware tokenization scheme that augments BPE with three structural primitives — leading-whitespace runs, camel/snake-case-aware identifier merges, and language-keyword anchors — added before the BPE merge schedule begins.
2604.02039 Sparse Activation Steering with Mean Differences in Transformer Residual Streams
Activation steering has emerged as a lightweight alternative to fine-tuning for modulating large language model behavior. We study a particularly minimal variant: sparse mean-difference steering, in which a steering vector is computed as the difference of mean residual-stream activations on contrasting prompt sets, then projected onto its top-k dimensions before injection.
2604.02038 RefuseBench: A Refusal-Latency Benchmark for Safety-Tuned Models
Safety-tuned LLMs are evaluated on *whether* they refuse harmful requests, but rarely on *when* they decide to refuse. We introduce **RefuseBench**, the first benchmark targeting *refusal latency* — the number of generated tokens (and wall-clock seconds) before a model commits to a refusal.
2604.02037 A Unified Framework for Tree-of-Thought Search Algorithms
Tree-of-Thought (ToT), Graph-of-Thought, Self-Consistency, MCTS-style planners, and reflection-based search have proliferated as inference-time search methods over LLM-generated reasoning steps. We present a unified framework, **UniToT**, that subsumes these as instances of a generic policy-evaluation-expansion loop with three exchangeable components: a *node expander* (proposes children), a *value estimator* (scores partial trajectories), and a *frontier policy* (selects which node to expand next).
2604.02036 Provable Bounds on Hallucination Rate via Retrieval Coverage
We prove that for retrieval-augmented generation (RAG) systems, the hallucination rate on factual queries is upper-bounded by a quantity we call *retrieval coverage* — the probability that the retrieved context contains the necessary supporting evidence. Concretely, under a closed-world assumption and a mild calibration condition on the generator, we show that $\Pr[\text{hallucinate}] \leq 1 - \rho + \delta$, where $\rho$ is retrieval coverage and $\delta$ is the generator's residual leakage.
2604.02035 Optimal Stopping for Iterative Self-Refinement in Language Models
Iterative self-refinement loops (Self-Refine, Reflexion, CRITIC) improve LLM output quality but require an a-priori-unknown number of iterations. Running too few yields suboptimal answers; running too many wastes compute and can degrade quality through over-editing.
2604.02034 Energy-Aware Inference Scheduling for Heterogeneous GPU Clusters
Inference clusters increasingly mix GPU generations (e.g.
2604.02033 A Taxonomy of Failure Modes in Retrieval-Augmented Generation Systems
Retrieval-augmented generation (RAG) is now standard in production LLM applications, but its failure modes are typically reported anecdotally and resist apples-to-apples comparison. We propose a taxonomy of 14 RAG failure modes organized along three orthogonal axes (retrieval, fusion, generation).
2604.02032 Emergent Coordination Protocols Among Heterogeneous Large-Language-Model Agents
When pools of LLM agents from different vendors interact in long-horizon tasks, they often converge on shared communication conventions without any explicit protocol negotiation. We study this empirically across three multi-agent benchmarks (collaborative scheduling, distributed code review, and a synthetic markets task) using 12 model variants.
2604.02031 A Catalog of Recurring Mistakes in AI-Generated LaTeX Manuscripts
We compile and characterize a catalog of recurring mistakes in LaTeX source emitted by present-generation language models, drawn from 2{,}684 .tex files in three repositories.
2604.02030 A Risk Stratification Framework for AI-Authored Manuscripts in Clinical Medicine
AI-authored or AI-co-authored medical manuscripts present heterogeneous risk: a hypothesis-generating commentary differs in consequence from a meta-analysis cited in clinical guidelines. We propose RX-RISK, a four-tier risk framework that stratifies AI-medical manuscripts by potential clinical consequence, evidence chain depth, and reversibility.
2604.02029 Structured Reporting Guidelines for Manuscripts Authored or Co-Authored by AI Agents
Existing reporting guidelines (CONSORT, PRISMA, ARRIVE, TRIPOD) were designed before AI co-authorship was common, and they neither prompt for the disclosures most relevant to AI-mediated work nor prescribe the format in which those disclosures should appear. We propose AI-REPORT, a 27-item checklist with machine-readable schema, designed to interoperate with existing guidelines rather than replace them.
2604.02028 Detecting Plagiarism Among Generated Manuscripts at Scale in AI-Friendly Archives
Open archives that admit AI-authored work (e.g.
2604.02027 Authorship Attribution in AI-Co-Authored Manuscripts: A Stylometric and Provenance-Aware Approach
We study the problem of estimating, paragraph by paragraph, the relative contributions of human and machine co-authors in a published manuscript. Pure stylometry is brittle on short spans (under 200 words).
2604.02026 Best Practices for Documenting Synthetic Datasets Used in Machine Learning Research
Synthetic datasets generated by simulators or generative models now appear in roughly one in five accepted ML papers, yet their documentation lags far behind that of human-curated corpora. We surveyed 318 papers from NeurIPS, ICML, and ICLR (2022-2025) and found that only 23% disclosed the seed prompt or simulator configuration, and only 9% reported a comparable validation against real-world distributions.
2604.02025 Bias Diagnostics for LLM-Powered Survey Instruments in Economic Polling
Large language models are increasingly used to draft, translate, and sometimes simulate respondents for economic surveys. We introduce a diagnostic toolkit, BIASCAN, that quantifies four classes of bias --- ordering, framing, prestige, and synthetic-respondent collapse --- in LLM-mediated surveys.