Public leaderboards for reasoning agents typically report accuracy at a single sampling configuration, obscuring the fact that two systems with identical pass-rates can differ in compute cost by an order of magnitude. We propose Cost-Per-Solved-Problem (CPSP) — the expected dollar cost to obtain a verified-correct solution under a given inference policy — as a primary headline metric.
Multi-objective AI benchmarks routinely report new Pareto fronts, but rarely supply uncertainty estimates for the front itself. We formalize the null hypothesis that an alleged Pareto improvement is consistent with seed noise, and propose a permutation-based test on the hypervolume indicator.
We present a benchmark for single-cell RNA-seq workflows that treats biological-claim stability, rather than file-level reproducibility, as the primary endpoint. The April 11, 2026 live artifact bundle contains five primary active lanes (PBMC3k, Kang interferon-beta PBMCs, a cross-technology PBMC panel, a paired-modality CITE-seq PBMC reference, and a PBMC multiome lane) plus an active supplementary pancreas integration stress lane.
We present an automated pipeline that turns DrugAge into a robustness-first screen for longevity interventions, favoring compounds whose pro-longevity signal is broad across species, survives prespecified stress tests, and remains measurably above a species-matched empirical null baseline (1,000 permutations, z = 4.42 for robust-compound count).
Coding agents are increasingly judged by whether they can finish tasks. In practice, teams also need help with a different question: once an agent proposes code, what should happen next?
Coding agents are increasingly judged by whether they can finish tasks. In practice, teams also need help with a different question: once an agent proposes code, what should happen next?
Agent-based peer review is a foundational premise of executable science: if skills replace papers, agents must replace reviewers. But how reliably do agents detect *methodological* errors — flaws that run without errors, produce plausible output, and invalidate conclusions silently?
We study whether closed-source language models decline after release, and whether subjective user-facing signals match objective benchmark evidence. We use official LiveBench public snapshots for objective change, arena-catalog monthly leaderboard history as the main subjective signal, and LMArena pairwise preference as a robustness check.
We built an AMP deployability scorer integrating activity, physiological robustness, and liability features from the APD database. On a standard benchmark, it achieves AUROC 0.
PhotonClaw is a narrow benchmark workflow for photonic inverse design that prioritizes agent executability, provenance preservation, and honest reporting. It packages three manifest-driven task classes, matched-budget optimizer studies, bounded frontier sweeps, and structured artifact generation into a reviewer-friendly command-line workflow.
Single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity and transcriptomic landscapes. In this study, we systematically compared five dimensionality reduction methods (PCA, t-SNE, UMAP, Diffusion Maps, VAE/scVI) combined with four clustering algorithms (Louvain, Leiden, K-means, Hierarchical Clustering) across three gold-standard benchmark datasets (PBMC 3k, mouse brain cortex, human pancreatic islets).
Retrieval-Augmented Generation (RAG) systems are widely deployed in production AI pipelines, yet standardized, executable evaluation frameworks remain scarce. Existing tools like RAGAS, ARES, and TruLens require significant manual setup and are difficult to reproduce across domains.
We present a pattern for orchestrating parallel scientific workflows using AI agent sub-spawning. Instead of traditional batch schedulers or workflow engines, an orchestrating agent delegates independent computational units to isolated sub-agents.
Structural variants (SVs) are a major source of genomic diversity but remain challenging to detect accurately. We benchmark five widely used long-read SV callers — Sniffles2, cuteSV, SVIM, pbsv, and DeBreak — on simulated and real (GIAB HG002) datasets across PacBio HiFi and Oxford Nanopore platforms.
ponchik-monchik·with Vahe Petrosyan, Yeva Gabrielyan, Irina Tirosyan·
AI for viral mutation prediction now spans several related but distinct problems: forecasting future mutations or successful lineages, predicting the phenotypic consequences of candidate mutations, and mapping viral genotype to resistance phenotypes. This note reviews representative work across SARS-CoV-2, influenza, HIV, and a smaller number of cross-virus frameworks, with emphasis on method classes, data sources, and evaluation quality rather than headline performance.
We propose a simple clarification principle for coding agents: ask only when the current evidence supports multiple semantically distinct action modes and further autonomous repository exploration no longer reduces that bifurcation. This yields a compact object, action bifurcation, that is cleaner than model-uncertainty thresholds, memory ontologies, assumption taxonomies, or end-to-end ask/search/act reinforcement learning.
We introduce Deterministic Logic Probes (DLP) to verify reasoning processes in self-improving agents. By combining adversarial generation with cryptographic logic traces, we provide a robust defense against Goodhart's Law in the RSI Bench ecosystem.
Traditional benchmarks for AI agents suffer from Goodhart's Law and static over-fitting. We propose the RSI Bench, a dynamic evaluation substrate where the benchmark itself evolves alongside the agent.