GC-content bias in microarray and RNA-seq platforms is well-documented but rarely corrected in differential expression analyses. We audit 20 widely-cited microarray datasets from GEO, applying a permutation-based test that evaluates whether the overlap between differentially expressed gene lists and GC-content-correlated genes exceeds chance.
SNNs promise energy efficiency via sparse spike trains, but accuracy requires sufficient timesteps, creating a latency-accuracy tradeoff. We characterize this for 8 SNN architectures on CIFAR-10/100 and DVS-Gesture at timesteps 1-128.
The sim-to-real transfer gap is assumed to grow with task complexity, but we find a U-shaped relationship. Across 6 manipulation tasks (reaching, pushing, pick-and-place, stacking, insertion, bimanual assembly) with 5 domain randomization levels on Franka Emika: simple tasks transfer well (gap 8-12%), moderate tasks show maximum gap (28-41%), complex tasks show reduced gap (18-24%).
In cooperative MARL, free-riding agents contribute minimally while benefiting from team rewards. We propose Shapley Contribution Tracking (SCT) using online Shapley value approximation.
Multi-agent LLM systems chain multiple model instances via natural language, but scaling properties are unknown. We study 2-16 agents across four patterns (sequential, broadcast, hierarchical, peer-to-peer).
Fault-tolerant LLM training requires periodic checkpointing. We analyze the cost structure across 64-4,096 GPUs, comparing checkpoint overhead against failure recovery cost.
Distributed LLM training suffers from straggler nodes that impose synchronization barriers. We analyze 2,400 training runs on clusters of 10-512 GPUs with data/tensor/pipeline parallelism.
LLM APIs process inputs autoregressively, coupling response latency to input/output length. We demonstrate this creates an exploitable timing side channel: observing only response time reveals input token count with 93.
Prompt injection is a critical LLM security vulnerability. We analyze the tradeoff between injection resistance and helpfulness across 12 models from 4 families.
LLMs generate unit tests with impressive coverage, but we challenge this optimism using mutation testing. We evaluate GPT-4, Claude-3, CodeLlama-34B, and DeepSeek-Coder-33B on 200 Python functions from popular libraries.
Code review thoroughness is believed to decrease with PR size, but quantitative evidence is scarce. We analyze 50,247 reviews from 187 open-source GitHub repositories.
Semantic segmentation quality measured by IoU treats all pixels equally, but boundary pixels are inherently ambiguous and annotator agreement drops to near-chance there. We propose Attention Map Entropy (AME) computed from self-attention maps at the penultimate layer of ViT-based segmentation models.
Small object detection remains challenging despite architectural advances. We characterize resolution dependence by evaluating 6 detectors (YOLOv8, DETR, Faster R-CNN, DINO, Co-DETR, RT-DETR) on VisDrone and DOTA at 8 resolutions from 320×320 to 2560×2560.
Vision Transformers were hypothesized to be more shape-biased than CNNs due to global attention, but findings are contradictory. We resolve this through Fourier-domain selective masking: removing spatial frequency bands from ImageNet images and measuring accuracy degradation.
Learning rate warmup is near-universal in deep learning training, yet the optimal warmup duration is typically found through expensive grid search. We conduct a controlled comparison across Transformers and State-Space Models (Mamba) on language modeling, image classification, and time-series forecasting, training 840 models with warmup durations from 0 to 20% of training.
Feature attribution methods—Integrated Gradients, SHAP, LIME, Attention, GradCAM—often disagree on the same input. We investigate whether this disagreement is systematic by measuring pairwise agreement (Kendall's τ and top-k overlap) as a function of model depth.
Grokking—sudden generalization long after memorization—is difficult to predict. We identify a precursor: the Gradient Acceleration Index (GAI), the second derivative of gradient norm w.
Learning rate warmup is near-universal in deep learning training, yet the optimal warmup duration is typically found through expensive grid search. We conduct a controlled comparison across Transformers and State-Space Models (Mamba) on language modeling, image classification, and time-series forecasting, training 840 models with warmup durations from 0 to 20% of training.
The double descent phenomenon—where test error first decreases, then increases, then decreases again as model complexity grows—has been extensively documented under in-distribution evaluation. We investigate whether double descent persists under distribution shift by training 2,100 models (7 architectures × 6 widths × 50 seeds) on CIFAR-10 and evaluating under five controlled shift types: covariate shift (Gaussian noise), label shift (10% flip), domain shift (CIFAR-10.