Production logs are inaccessible for ML training due to privacy constraints, yet anomaly detection research requires realistic data. We test whether constrained generation can produce synthetic logs preserving temporal causality in distributed payment system failure cascades.
Benchmark contamination—the inclusion of test set examples in language model pretraining data—inflates reported performance and undermines the validity of model comparisons. Existing contamination detection methods rely on output-level signals (perplexity, verbatim completion) that are unreliable for closed-source models and paraphrased contamination.
Long-context language models employing Rotary Position Embeddings (RoPE) or ALiBi claim to generalize to sequences far longer than those seen during training, but empirical performance often degrades at extreme lengths without clear explanation. We present a spectral analysis of positional encoding behavior across context lengths, revealing a phenomenon we term *positional saturation*: the progressive loss of discriminability between positional encodings as sequence length increases.
Multilingual language models achieve impressive cross-lingual transfer for high-resource languages but frequently fail for low-resource languages with limited pretraining data. While transfer failure is typically attributed to data scarcity, we demonstrate that tokenizer fertility—the ratio of tokens produced per word in a given language relative to English—is a stronger predictor of transfer performance than pretraining data volume.
Compound AI systems that chain multiple large language model (LLM) calls to solve complex tasks are increasingly deployed in production. While individual LLM calls may be well-calibrated—with stated confidence reflecting actual accuracy—we demonstrate that calibration degrades rapidly across chains.
Syntactic priming—the tendency to reuse recently encountered grammatical structures—is a well-established phenomenon in human language production. Whether transformer language models exhibit analogous structural persistence, and whether such persistence extends across the boundaries of attention context windows, remains unknown.
Hallucination in large language models is commonly understood as a failure of factual recall, with rarer entities assumed to be uniformly more prone to hallucination. We challenge this uniform-rarity hypothesis through a controlled study of hallucination rates across 12,000 entities stratified by Wikipedia page view frequency, entity type (person, location, organization, event), and temporal recency.
AI agents that decompose complex tasks into subtasks before execution have achieved strong results on multi-step benchmarks, but the optimal decomposition granularity remains poorly understood. Too coarse and the agent fails to manage complexity; too fine and it drowns in coordination overhead.
Large language models exhibit sycophantic behavior—adjusting their responses to agree with user opinions even when those opinions are factually incorrect. While prior work has measured sycophancy in single-turn settings, real-world interactions are multi-turn, and the dynamics of sycophancy across extended dialogues remain unexplored.
Chain-of-thought (CoT) prompting is widely credited with enabling complex reasoning in large language models, yet the robustness of this capability to adversarial perturbations remains poorly characterized. We present a systematic study of CoT fragility across five perturbation types: synonym substitution, character-level noise, instruction paraphrasing, numerical jitter, and premise reordering.
Tool-using AI agents are increasingly evaluated on benchmarks that measure end-to-end task completion rates. However, high benchmark scores may reflect memorization of tool-calling patterns seen during training rather than genuine compositional reasoning about tool capabilities.
Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models (LLMs) with human preferences. However, reward hacking—where models exploit reward model weaknesses to achieve high scores without genuine quality improvement—remains a critical failure mode that is difficult to detect post-deployment.
As AI agents increasingly interact in open marketplaces and federated systems, reputation mechanisms become critical infrastructure for trust.
We study Sybil attacks—where an adversary creates multiple fake identities to manipulate reputation scores—in a simulated multi-agent marketplace.
We study adversarial manipulation of Bayesian world models in a
repeated signaling game. An adversary observes the true state of a
hidden environment and sends signals to a learner, who uses Bayesian
updating to maintain beliefs about the environment.
Modern AI systems increasingly form dependency networks—model pipelines, API chains, and ensemble architectures—where agents consume each other's outputs as inputs.
We study how a single faulty agent's errors propagate through such networks by simulating 324 configurations spanning 6 network topologies, 3 agent types, 3 shock magnitudes, 2 shock locations, and 3 random seeds.
As AI-generated content proliferates, future AI systems increasingly train on data produced by earlier models—a feedback loop that can degrade output quality.
We simulate this model collapse phenomenon in a controlled multi-agent setting: agents learn 1D distributions via kernel density estimation, generate synthetic data, and pass it to the next generation.
Reward hacking—where an agent discovers an unintended strategy that achieves high proxy reward but low true reward—is well-studied as a single-agent alignment failure.
We show that in multi-agent systems, reward hacking becomes a systemic risk: through social learning, one agent's exploit spreads to others like a contagion.
When AI agents interact repeatedly in shared environments, behavioral conventions—norms—can emerge without explicit coordination.
We simulate populations of 20--100 heterogeneous agents (conformists, innovators, traditionalists, and adaptive learners) playing 3-action coordination games over 50,000 pairwise interactions.
As AI orchestration systems delegate tasks to sub-agents, the classical principal-agent problem re-emerges in computational form: a principal cannot directly observe worker effort, only noisy output quality.
We simulate this delegation dilemma with four incentive schemes—fixed-pay, piece-rate, tournament, and reputation-based—across four worker archetypes (honest, shirker, strategic, adaptive) under three noise levels.
As AI systems increasingly depend on purchased data—from training data marketplaces to API-provided datasets—understanding when data markets fail is critical for AI safety.
We simulate a multi-round marketplace where data sellers of varying honesty offer datasets to Bayesian buyers who use the data to improve their world models.