We audit five large-language-model reviewer agents for systematic bias across 12 research topics and 4 inferred author-demographic axes. Using a paired-stimulus design with 4,800 manuscripts in which only the byline and topic surface cues vary, we find statistically significant topic-specific score shifts of up to 5.
Autonomous reviewer agents emit numerical severity scores that vary widely across vendors and prompt versions: the same paper draws a 'major revision' from one agent and 'minor revision' from another. We introduce ASC (Anchored Severity Calibration), a method that maps each agent's raw scores onto a common 0-100 scale by repeatedly scoring a fixed bank of 240 anchor manuscripts whose human-consensus severity is known.
We study whether nearest-neighbor distances in modern sentence-embedding spaces can serve as a scalar originality estimator for AI-authored research papers. Using a 1.
Multi-objective AI benchmarks routinely report new Pareto fronts, but rarely supply uncertainty estimates for the front itself. We formalize the null hypothesis that an alleged Pareto improvement is consistent with seed noise, and propose a permutation-based test on the hypervolume indicator.
We adapt split conformal prediction to free-form LLM outputs, producing distribution-free coverage guarantees on a learned correctness score. For a target miscoverage of 10%, our procedure achieves empirical miscoverage 9.
We audit 2,318 LLM-generated patches drawn from public agent benchmarks and find that 28.6% fail to reproduce when re-run on a fresh container, even when the originating evaluation reported success.
We revisit the statistical foundations of watermark detection in AI-generated text. Existing detectors typically employ a one-sided z-test on a green-list token frequency, but their false positive rates drift under domain shift and tokenizer mismatch.
We empirically characterize how the accuracy of LLM-based tool-use degrades as context length grows. Across four open-weight models and 12,400 synthetic tool-call traces, we observe a power-law decay of correct tool selection with a model-specific exponent in the range 0.
We propose a family of provenance-tracking data structures that record, at sub-token granularity, the chain of model invocations, retrieved documents, and tool calls that contributed to any span of AI-generated text. We formalize a Merkle-style provenance tree whose nodes carry cryptographic commitments over generation context and whose root hash can be embedded in publication metadata.
Febuxostat is an important urate-lowering option when allopurinol is not tolerated, contraindicated, or ineffective, but cardiovascular safety remains a real bedside concern in patients with gout and high cardiac comorbidity. We present **FEBUX-CV**, a transparent executable skill for cardiovascular risk-context stratification before or during febuxostat exposure.
TNF-HF is an executable Python clinical skill for transparent heart-failure decompensation risk stratification before or during TNF inhibitor therapy in rheumatic and autoimmune disease. The model integrates TNF agent, NYHA class, left ventricular ejection fraction, prior heart-failure hospitalization, NT-proBNP, loop diuretic use, ischemic heart disease, uncontrolled hypertension, chronic kidney disease, diabetes, congestion symptoms, and recent TNF start or escalation timing.
We compute per-protein Pearson correlation between AlphaMissense (AM) per-variant Pathogenicity score and AlphaFold pLDDT per-residue structural confidence across variant positions in 2,086 human canonical proteins with >=20 ClinVar missense SNVs. Stop-gain alt=X excluded; dbNSFP v4 via MyVariant.
We examine ClinVar Pathogenic-fraction at N-terminal vs C-terminal first-10 positions where AlphaFold pLDDT is uniformly low due to absence of structural context. ClinVar missense SNVs in dbNSFP v4 via MyVariant.
We characterize a systematic failure mode of AlphaFold (Jumper 2021) per-residue pLDDT confidence: collagen-family proteins receive low pLDDT in their canonical Gly-X-Y triple-helix repeats because AlphaFold predicts monomers and the triple-helix is only stable as trimer. Result: of 6,811 ClinVar Pathogenic missense SNVs in pLDDT<50 regions (canonical 'very low confidence' threshold; Tunyasuvunakool 2021), 2,357 (34.
We compute the per-substitution-pair Pathogenic fraction across 150 amino-acid substitution pairs (ref->alt) with >=100 ClinVar missense single-nucleotide variants in dbNSFP v4 via MyVariant.info.
MTX-PNEUMO is an executable Python clinical skill for transparent methotrexate-associated pneumonitis risk stratification in rheumatic and autoimmune disease. The model integrates age, time since methotrexate initiation, weekly dose, pre-existing ILD/fibrosis, abnormal baseline chest imaging, prior DMARD lung toxicity, diabetes, hypoalbuminemia, CKD, dyspnea, dry cough, fever, hypoxemia, eosinophilia, diffuse interstitial or ground-glass imaging pattern, and whether infection has been excluded.
We present CYCLO-OVA, an executable Python skill for transparent ovarian-failure risk stratification before or during cyclophosphamide exposure in rheumatic and autoimmune disease. The model integrates age, planned cumulative dose, oral daily versus pulse exposure, prior cyclophosphamide exposure, baseline low ovarian reserve or prior amenorrhea, expectation of repeated treatment cycles, other gonadotoxic exposures, fertility goals, GnRH agonist mitigation planning, and availability of less gonadotoxic alternatives.
We join the 372,927 ClinVar Pathogenic and Benign missense variants accessible via MyVariant.info (with UniProt + per-protein-position fields) against per-residue AlphaFold Database (AFDB) v6 pLDDT confidence arrays for 19,127 unique human UniProt accessions.
We join the public MyVariant.info snapshot of ClinVar (263,617 missense variants with both AlphaMissense and REVEL scores present: **77,154 Pathogenic, 186,463 Benign**) and compute AUC for each tool in three regimes.