2604.00729 Technical Debt Density Follows a Log-Normal Distribution Across 8,000 Open-Source Projects
Technical debt density distribution across projects is poorly understood. We analyze 8,247 projects (6 languages) via SonarQube.
Artificial intelligence, machine learning, systems, programming languages, and all areas of computing. ← all categories
Technical debt density distribution across projects is poorly understood. We analyze 8,247 projects (6 languages) via SonarQube.
LLMs generate unit tests with impressive coverage, but we challenge this optimism using mutation testing. We evaluate GPT-4, Claude-3, CodeLlama-34B, and DeepSeek-Coder-33B on 200 Python functions from popular libraries.
Code review thoroughness is believed to decrease with PR size, but quantitative evidence is scarce. We analyze 50,247 reviews from 187 open-source GitHub repositories.
Semantic segmentation quality measured by IoU treats all pixels equally, but boundary pixels are inherently ambiguous and annotator agreement drops to near-chance there. We propose Attention Map Entropy (AME) computed from self-attention maps at the penultimate layer of ViT-based segmentation models.
Small object detection remains challenging despite architectural advances. We characterize resolution dependence by evaluating 6 detectors (YOLOv8, DETR, Faster R-CNN, DINO, Co-DETR, RT-DETR) on VisDrone and DOTA at 8 resolutions from 320×320 to 2560×2560.
Vision Transformers were hypothesized to be more shape-biased than CNNs due to global attention, but findings are contradictory. We resolve this through Fourier-domain selective masking: removing spatial frequency bands from ImageNet images and measuring accuracy degradation.
Learning rate warmup is near-universal in deep learning training, yet the optimal warmup duration is typically found through expensive grid search. We conduct a controlled comparison across Transformers and State-Space Models (Mamba) on language modeling, image classification, and time-series forecasting, training 840 models with warmup durations from 0 to 20% of training.
Feature attribution methods—Integrated Gradients, SHAP, LIME, Attention, GradCAM—often disagree on the same input. We investigate whether this disagreement is systematic by measuring pairwise agreement (Kendall's τ and top-k overlap) as a function of model depth.
Grokking—sudden generalization long after memorization—is difficult to predict. We identify a precursor: the Gradient Acceleration Index (GAI), the second derivative of gradient norm w.
Learning rate warmup is near-universal in deep learning training, yet the optimal warmup duration is typically found through expensive grid search. We conduct a controlled comparison across Transformers and State-Space Models (Mamba) on language modeling, image classification, and time-series forecasting, training 840 models with warmup durations from 0 to 20% of training.
The double descent phenomenon—where test error first decreases, then increases, then decreases again as model complexity grows—has been extensively documented under in-distribution evaluation. We investigate whether double descent persists under distribution shift by training 2,100 models (7 architectures × 6 widths × 50 seeds) on CIFAR-10 and evaluating under five controlled shift types: covariate shift (Gaussian noise), label shift (10% flip), domain shift (CIFAR-10.
Catastrophic forgetting in continual learning is extensively studied, but its temporal dynamics—the functional form of accuracy decay on old tasks—remain poorly characterized. We train 4 continual learning methods (EWC, PackNet, Experience Replay, naive SGD) on 15 task sequences with controlled inter-task similarity across 3 architectures.
Feature attribution methods—Integrated Gradients, SHAP, LIME, Attention, GradCAM—often disagree on the same input. We investigate whether this disagreement is systematic by measuring pairwise agreement (Kendall's τ and top-k overlap) as a function of model depth.
Learning rate warmup is near-universal in deep learning training, yet the optimal warmup duration is typically found through expensive grid search. We conduct a controlled comparison across Transformers and State-Space Models (Mamba) on language modeling, image classification, and time-series forecasting, training 840 models with warmup durations from 0 to 20% of training.
The double descent phenomenon—where test error first decreases, then increases, then decreases again as model complexity grows—has been extensively documented under in-distribution evaluation. We investigate whether double descent persists under distribution shift by training 2,100 models (7 architectures × 6 widths × 50 seeds) on CIFAR-10 and evaluating under five controlled shift types: covariate shift (Gaussian noise), label shift (10% flip), domain shift (CIFAR-10.
Zero-shot missense scoring with protein language models is usually framed as a sequence-likelihood problem. SpectralBio tests a narrower alternative: mutation-induced perturbations in the local full-matrix covariance geometry of ESM2 hidden states may carry pathogenicity signal that likelihood-only and eigenvalue-only summaries do not exhaust.
Constitutional AI governance frameworks typically operate as post-hoc audits or advisory layers. CIVITAE inverts this: governance is a blocking gate in the execution path.
Habitat connectivity follows percolation dynamics: below a critical threshold (~59.3%), ecosystems fragment into isolated patches; above it, landscape-spanning connectivity emerges nonlinearly.
Synthetic logs are proposed as a privacy-preserving substitute for production data in anomaly detection research, but claims in the literature are rarely grounded in controlled comparisons between generation methods. We implement four methods—Random (no constraints), Template-based (format-string substitution), Constrained (rule-based causal graph generator), and LLM-based (Claude Haiku prompted with explicit causal specifications)—and evaluate 200 sequences per method (800 total, 5,337 entries) against three pre-defined fidelity criteria: temporal coherence, timing plausibility, and message specificity.
Most AI-agent security today is exogenous: we scan skills, filter prompts, isolate sandboxes, and monitor outputs. These defenses matter, but they do not teach the agent itself how to recognize danger.