A Calibrated Claim-Stability Benchmark for Single-Cell RNA-seq Workflows
A Calibrated Claim-Stability Benchmark for Single-Cell RNA-seq Workflows
Problem and Thesis
Most single-cell RNA-seq workflows are reproducible at the container, file, or script level, but they are not calibrated at the level that matters scientifically: whether the biological claims survive reasonable perturbations. The central contribution of this repository is therefore no longer a PBMC3k workflow note. It is a benchmark that asks whether a workflow can distinguish robust from fragile biological conclusions across multiple public single-cell lanes.
The benchmark is built around three ideas:
- Claim stability is the primary endpoint. The benchmark asks whether canonical biological claims degrade under lane-compatible negative controls.
- External validity is lane-specific. Orthogonal or label-backed metrics are necessary, but they should be used as gates only where they are biologically appropriate.
- Reference-based calibration should be statistical, not rhetorical. Bootstrap confidence intervals and empirical-null permutation p-values quantify how far observed concordance sits above chance.
This is a claim-stability calibration contribution for scRNA-seq workflows, not a firstness claim about scRNA benchmark automation in general. The relevant novelty is the biological-claim endpoint, the frozen control-compatibility policy, and the calibration of lane-specific external evidence.
Benchmark Design
Freeze Contract
Each benchmark lane is normalized into a common freeze contract:
canonical_input.h5addataset_manifest.jsonfreeze_audit.jsonsource_provenance.mdbenchmark_protocol.json
The runtime pipeline consumes only canonical_input.h5ad. Freeze metadata records the source identifier, source URL, download or publication provenance, SHA256, retained metadata columns, and observed matrix shape.
The benchmarked workflow itself is a Scanpy-centered canonical pipeline: highly variable gene flagging, PCA, nearest-neighbor graph construction, Leiden clustering over frozen resolution sweeps, UMAP for visualization, and marker-based cluster annotation.
Lane Taxonomy
The current submitted artifact bundle is built from six active lanes. A deferred Tabula Sapiens extension remains outside this evidence set.
Primary active lanes:
pbmc3k: easy sanity lane for canonical PBMC lineage recoverykang_ifnb: donor and stimulation lane based on the Kang et al. interferon-beta PBMC datasetpbmcsca: cross-technology PBMC lane derived from the Ding et al. comparison panelciteseq_pbmc: orthogonal-modality lane with paired protein-backed labelsmultiome_pbmc: orthogonal-modality PBMC lane with paired chromatin-derived reference labels
Active supplementary lane:
pancreas_integration: cross-study pancreas stress lane on the harmonized scIB panel
Deferred lane outside the submitted evidence set:
tabula_sapiens_subset: atlas-stress extension for later expansion
Claim Families
The benchmark evaluates config-driven claim families rather than raw cluster IDs:
lineage_presencestructure_preservationstability_consistencyprotein_concordancewhere protein-backed labels exist
This design lets the benchmark score whether the workflow still supports the intended biological conclusion even when exact partitions or embeddings move.
Claim-Score Definition
For a lane with claim-family set F, each family f contains a finite set of Boolean checks D_f. The family score is the fraction of those checks that pass:
s_f = (1 / |D_f|) * sum_{d in D_f} I[d passes]
The lane-level claim score is the weighted mean of family scores:
S = (sum_{f in F} w_f * s_f) / (sum_{f in F} w_f)
A run is marked overall_status = passed only if both conditions hold:
S >= tau_overall- every family clears its own frozen threshold
s_f >= tau_f
The primary endpoint compares canonical and compatible-control claim scores through margins m_c = S_canonical - S_control. A lane passes claim-score control separation only if every compatible control clears its configured degradation threshold delta_c. In the current active six-lane bundle, all configured active-lane controls use delta_c = 0.05.
The lane-specific weights and overall thresholds are frozen before evaluation:
pbmc3k: uniform five-family weights0.20each, overall threshold0.90kang_ifnb: lineage0.30, composition0.20, structure0.25, stability0.25, overall threshold0.80pbmcsca,multiome_pbmc, andpancreas_integration: lineage0.35, structure0.35, stability0.30, overall threshold0.80citeseq_pbmc: lineage0.25, structure0.25, stability0.20, protein0.30, overall threshold0.80
The paper bundle now also emits these parameters as a generated artifact table at outputs/benchmark_live_20260411_ceiling/paper_tables/claim_score_definition_table.md.
Negative Controls
The current control panel includes lane-compatible perturbations:
overclusterunderclusterhvg_truncationmarker_shuffleprotein_shufflefor the CITE-seq lanechromatin_shufflefor the multiome lane
Compatibility is explicit. The primary claim-score endpoint is evaluated only against claim-compatible controls; modality-label shuffles are reserved for the matching orthogonal metric. In the April 11 bundle this means protein_shuffle and chromatin_shuffle do not count against the primary claim-score pass/fail endpoint, while marker_shuffle and generic pipeline overrides do.
The CITE-seq lane intentionally uses a smaller claim-compatible panel than the generic PBMC lanes. Its frozen primary endpoint keeps overcluster and marker_shuffle as claim-compatible controls because the lane's main external test is already the protein-backed agreement gate; protein_shuffle probes that orthogonal gate directly, while transcript-only undercluster and hvg_truncation are not part of the frozen CITE primary claim panel.
Gating Versus Diagnostic Metrics
The benchmark does not treat every external metric as a universal gate.
pbmc3k: no external gatekang_ifnb: ARI, NMI, and majority purity are diagnostic onlypbmcsca: ARI, NMI, and majority purity are diagnostic onlyciteseq_pbmc:protein_agreement_overallis the gating external metric; ARI, NMI, and majority purity are diagnosticmultiome_pbmc:chromatin_agreement_overallremains diagnostic in the current bundle because compatible-control separation is too weak for gate promotionpancreas_integration: ARI, NMI, and majority purity are diagnostic only
This policy is deliberate. The live artifact set shows that coarse reference-label concordance does not behave as a reliable gate for every lane-control combination, especially in donor-condition and cross-technology settings.
Statistical Calibration
The checked-in April 11, 2026 live bundle now uses one paper-depth calibration policy throughout:
- 2000 bootstrap replicates for canonical reference-backed metrics
- 10000 empirical-null label permutations for canonical reference-backed metrics
- 2000 compatible-control margin bootstraps and 10000 sign-flip null draws for the primary claim score
- 95% percentile confidence intervals
For reference-backed metrics, the null is defined by permuting reference labels against preserved predicted labels and cluster assignments. Control separation for those metrics is then summarized from the observed control rows rather than from separate control-run bootstraps. For the primary endpoint, claim-score uncertainty is summarized from compatible-control margins rather than from cell-level resampling.
Per-lane sign-flip p-values are retained in the artifact bundle for completeness, but they are discretized by construction: with k compatible controls, the minimum attainable p-value is 1 / 2^k. In the current bundle that means lanes with four compatible controls cannot go below 1/16, and the CITE-seq lane with two compatible controls cannot go below 1/4. We therefore interpret sign-flip support primarily at the pooled benchmark level, where the combined control count yields a useful null resolution.
Frozen Evaluation Policy
The central policy surfaces were frozen before the April 11, 2026 evaluation pass. In particular, lane inclusion, claim-compatible controls, modality-only controls, external gate assignments, comparator families, and canonical selection bounds were fixed in config before the final bundle was regenerated.
| Dataset | Claim-compatible controls | Modality-only controls | External gate(s) | Canonical bounds |
|---|---|---|---|---|
pbmc3k |
overcluster, undercluster, hvg_truncation, marker_shuffle |
- | none | 8-10 clusters; resolutions 0.4, 0.6, 0.8, 1.0, 1.2 |
kang_ifnb |
overcluster, undercluster, hvg_truncation, marker_shuffle |
- | none | 6-14 clusters; resolutions 0.4, 0.6, 0.8, 1.0, 1.2 |
pbmcsca |
overcluster, undercluster, hvg_truncation, marker_shuffle |
- | none | 6-14 clusters; resolutions 0.2, 0.3, 0.4, 0.5 |
citeseq_pbmc |
overcluster, marker_shuffle |
protein_shuffle |
protein_agreement_overall |
6-14 clusters; resolutions 0.4, 0.6, 0.8, 1.0, 1.2 |
multiome_pbmc |
overcluster, undercluster, hvg_truncation, marker_shuffle |
chromatin_shuffle |
none in v1 | 6-14 clusters; resolutions 0.4, 0.6, 0.8, 1.0, 1.2 |
pancreas_integration |
overcluster, undercluster, hvg_truncation, marker_shuffle |
- | none in v1 | 6-12 clusters; resolutions 0.05, 0.055, 0.06, 0.07 |
Comparator coverage was frozen alongside this table: PBMC-family lanes carry CellTypist, SingleR, and Azimuth; multiome adds Azimuth ATAC; pancreas ranking claims currently rely on SingleR, while same-bundle pancreas CellTypist runs are retained but reported as inconclusive_ontology_mismatch below the mapped-fraction floor. The artifact bundle emits the full generated frozen-policy surface at outputs/benchmark_live_20260411_ceiling/paper_tables/frozen_policy_table.md.
Datasets
PBMC3k
PBMC3k remains the easiest positive-control lane. It is useful for proving that the benchmark can pass a well-behaved canonical PBMC analysis, but it is not the paper’s headline contribution.
Kang IFN-beta PBMC
The Kang lane tests whether a workflow preserves donor and interferon-response structure rather than merely coarse lineage labels. This is the lane where condition-signal diagnostics matter most.
PBMCSCA
The PBMCSCA lane tests robustness across technologies. In the current artifact bundle it uses a published canonical H5AD derived from the SCP424 matrix-plus-metadata source bundle and pinned by SHA256. The live canonical policy is intentionally lower-resolution than the original draft because the benchmark must reward stable cross-technology claims, not overfit fine label granularity that the lane does not support cleanly.
CITE-seq PBMC
The CITE-seq lane adds orthogonal protein evidence. Here exact protein-backed agreement is a legitimate gate, while transcript-only concordance metrics remain diagnostic.
Multiome PBMC
The multiome lane adds paired chromatin-derived reference labels built from gene-activity annotation on the matched ATAC modality. In the current bundle it strengthens the claim-score benchmark as a fifth primary lane, but chromatin agreement remains diagnostic because its compatible-control separation is positive but too small for gate promotion.
Pancreas Integration
The pancreas lane is an active supplementary stress lane built from the harmonized scIB panel. It broadens the paper beyond PBMC-only evidence at the claim-score level, while its label-backed metrics remain diagnostic in v1. Its canonical resolution sweep is intentionally much lower than the PBMC lanes because the cross-study endocrine and exocrine major-label structure separates at substantially coarser Leiden resolutions than PBMC immune subsets.
Primary Results
Canonical Pass Status
In the April 11, 2026 live artifact bundle, all six active canonical lanes passed claim evaluation:
pbmc3k: overall score1.000kang_ifnb: overall score0.917pbmcsca: overall score1.000pancreas_integration: overall score0.912citeseq_pbmc: overall score0.915multiome_pbmc: overall score1.000
All six also passed no-rerun semantic verification. The pancreas canonical selection now chooses a low-resolution run that satisfies the major-label claim set inside the configured 6-12 cluster bound rather than the earlier 31-cluster path. Across the six active lanes, the mean canonical overall score was 0.957.
Claim-Score Control Separation
The primary benchmark endpoint is claim-score degradation under compatible negative controls. All active workflow lanes cleared that endpoint in the refreshed bundle:
pbmc3k:4/4compatible controls degradedkang_ifnb:4/4compatible controls degradedpbmcsca:4/4compatible controls degradedpancreas_integration:4/4compatible controls degradedciteseq_pbmc:2/2compatible claim controls degraded, withprotein_shuffleexcluded from the primary endpointmultiome_pbmc:4/4compatible claim controls degraded, withchromatin_shuffleexcluded from the primary endpoint
At the benchmark level, claim-score control separation therefore passed across the full six-lane active set. Across the five calibration-core lanes, the mean compatible-control margin was 0.338 with 95% bootstrap CI 0.298-0.380 and sign-flip empirical p-value approximately 1e-4.
The headline statistics use two distinct denominators on purpose. The active-set pass count is 6/6 lanes because it includes the supplementary pancreas lane. The benchmark-level calibrated claim-margin summary uses only the five calibration-core lanes (pbmc3k, kang_ifnb, pbmcsca, citeseq_pbmc, multiome_pbmc), because pancreas is a supplementary stress lane rather than part of the calibration-core headline set.
External Validity
Canonical external-validity summaries were:
kang_ifnb: ARI0.469, NMI0.562, majority purity0.717pbmcsca: ARI0.404, NMI0.518, majority purity0.699pancreas_integration: ARI0.303, NMI0.485, majority purity0.651citeseq_pbmc: ARI0.484, NMI0.626, majority purity0.907, exact protein agreement0.588, compatibility-aware protein agreement0.713, broad-label protein agreement0.646multiome_pbmc: ARI0.161, NMI0.254, majority purity0.679, chromatin agreement0.479
Only the CITE-seq protein-backed metric is used as a gate in the current bundle. Its control separation remained strong: canonical exact protein agreement 0.588 versus mean compatible-control agreement 0.163, with mean separation margin 0.425 and all 3/3 compatible controls ranked below canonical.
The multiome lane now passes the primary claim endpoint, but its orthogonal chromatin metric is not promoted. Canonical chromatin agreement is 0.479, yet its compatible-control mean separation margin is only 0.016, only half of compatible controls rank below canonical, and its empirical-null p-value is 1.0. In this bundle, chromatin evidence therefore remains diagnostic rather than gating. Its low transcript-to-reference ARI and NMI are expected in this setting because transcript-derived labels are being compared against chromatin-derived reference labels, so bridge noise enters before any orthogonal gate-promotion decision is made.
Kang, PBMCSCA, and pancreas external label-backed metrics are all statistically above their empirical nulls on canonical runs, but they do not separate cleanly enough under the current control panels to serve as hard gates. They remain diagnostic.
Naive Uniform Gate Ablation
The lane-specific gate policy is not a retrospective convenience rule. Under a naive ablation that promotes every available external metric to a hard gate, 0/5 externally evaluated lanes would pass control separation in the current bundle:
kang_ifnbwould fail becauseNMIandmajority_puritydo not separate cleanly even thoughARIdoespbmcscawould fail becauseARI,NMI, andmajority_purityall remain unstable under compatible controlspancreas_integrationwould fail because all three label-backed metrics degrade under stress controlsciteseq_pbmcwould fail becauseARIandNMIfail even though the protein-backed gate passes stronglymultiome_pbmcwould fail because both transcript label metrics and chromatin agreement remain too weak for gate promotion
That ablation is exactly why the paper freezes a lane-specific gate/diagnostic policy instead of pretending every external metric is a universal benchmark gate. The generated ablation table is emitted at outputs/benchmark_live_20260411_ceiling/paper_tables/uniform_gate_ablation_table.md.
Comparator Surfaces
The refreshed canonical comparator bundle now includes four live method families in the same artifact set:
CellTypistacross all six active lanesSingleRacross all six active lanesAzimuthRNA across the five PBMC-family primary lanesAzimuth ATAConmultiome_pbmc
The cross-method picture is heterogeneous rather than rhetorical.
CellTypistmapped confidently on the five PBMC-family lanes, with mapped fractions between0.897and0.998, but remained inconclusive on pancreas at0.524, below the locked0.70floor. Its claim-score control separation still passed on all five primary lanes, but every primary canonical comparator run already failed claim evaluation at baseline, with canonical scores ranging from0.400to0.5625, and CITE exact protein agreement collapsed to0.0316. This is the clearest negative result in the current bundle: a transcript-only reference mapper can be consistently degraded by controls while still failing the biologically grounded endpoint before perturbation.SingleRmapped conclusively on all six active lanes, including pancreas at mapped fraction0.989. It passed comparator control separation onpbmc3k,pbmcsca,citeseq_pbmc, andmultiome_pbmc, but failed onkang_ifnbwith mean margin0.00625and on pancreas with mean margin-0.0802. Pancreas is therefore no longer excluded from ranking claims because of ontology mismatch; it is now a real comparator lane whose control behavior is simply weak.AzimuthRNA mapped conclusively on all five PBMC-family primary lanes, all at mapped fraction1.000. It passed comparator control separation onpbmc3k,kang_ifnb,citeseq_pbmc, andmultiome_pbmc, whilepbmcscaremained borderline with mean margin0.0481and only3/4degrading controls. On the orthogonal lanes, Azimuth RNA achieved CITE exact protein agreement0.8318and multiome chromatin agreement0.5816, both higher than the workflow’s own transcript-derived concordance summaries.Azimuth ATAClanded as a real multiome-only comparator on the frozen 10x fragments and exact canonical barcode set. Its canonical mapped fraction was1.000, its comparator claim score was0.9417, and its chromatin agreement was0.5761. Claim-score control separation passed across the multiome control panel. However, the bridge/chromatin agreement itself only moved underchromatin_shufflein this slice, so we keep it as comparator evidence rather than claiming a new hard orthogonal gate.
Bootstrap Intervals and Empirical Nulls
Canonical reference-backed metrics remained well above the empirical null for Kang, PBMCSCA, pancreas, and the CITE-seq gate, with empirical p-values approximately 1e-4 for those canonical runs. The refreshed 95% bootstrap intervals were:
kang_ifnbARI0.463-0.475, NMI0.557-0.567pbmcscaARI0.394-0.414, NMI0.511-0.525pancreas_integrationARI0.296-0.310, NMI0.480-0.491citeseq_pbmcARI0.461-0.509, NMI0.604-0.655, protein agreement0.566-0.610multiome_pbmcARI0.130-0.197, NMI0.220-0.300
The primary claim-score endpoint was calibrated separately from compatible-control margins. Its benchmark-level 95% interval was 0.298-0.380 around an observed mean margin of 0.338. Multiome chromatin agreement did not exceed its empirical null under the current diagnostic path.
Case Studies and Weak Spots
Kang: Why ARI and NMI Are Diagnostic Only
The Kang lane is fundamentally about donor and interferon-response biology, not about treating coarse cell-type labels as immutable cluster ground truth. The live condition-signal analysis makes this clear. Under the overcluster control, interferon-signal retention falls to 0.627 of canonical in CD4 T cells and 0.522 in CD8 T cells, while the undercluster control stays near canonical (0.998 and 1.012, respectively). This is a more biologically informative failure mode than insisting that ARI or NMI must always act as the primary gate.
CITE-seq: FCGR3A Mono Versus NK
The main weak spot in the CITE-seq lane is not hidden. FCGR3A Mono has exact agreement 0.000, and its dominant predicted label is NK in 89.4% of protein-backed reference cells. This is why the paper must report both exact and compatibility-aware summaries. The lane still clears its main gate because overall protein-backed agreement separates canonical from controls, but the FCGR3A Mono versus NK axis remains a diagnostic ambiguity that should be discussed explicitly rather than buried.
PBMCSCA: Why Reference Metrics Are Diagnostic Only
PBMCSCA still succeeds on the benchmark’s primary endpoint: the canonical run passed claim evaluation, and all 4/4 negative controls degraded the claim score. However, ARI and NMI do not separate cleanly under the current cross-technology control panel; some compatible controls improve coarse label concordance. That makes those metrics useful diagnostics but poor hard gates for this lane in the current version of the benchmark.
Reproducible Artifact Set
The paper-ready live bundle is pinned under outputs/benchmark_live_20260411_ceiling/ and includes:
benchmark_summary.jsoncontrol_separation.jsonexternal_validity_summary.jsonbootstrap_intervals.csvempirical_null_summary.jsoncomparators/comparator_summary.jsoncomparators/comparator_scores.csvpaper_tables/*.mdpaper_tables/claim_score_definition_table.mdpaper_tables/frozen_policy_table.mdpaper_tables/uniform_gate_ablation_table.mdpaper_figures/*.pngpaper_figures/figure_data/*.csvpaper_figures/release_manifest.json
The release manifest records source URLs, SHA256 values, benchmark commands, and artifact hashes so that the figures and tables are tied to one exact live bundle.
Discussion
The current six-lane bundle supports a clearer methodological conclusion than the original draft. Claim-score degradation is the most reliable benchmark endpoint across heterogeneous lanes, while external evidence must stay lane-specific. CITE-seq shows the positive case: a genuinely orthogonal modality can support a hard gate. Kang, PBMCSCA, pancreas, and multiome show the negative case: coarse transcript-reference concordance or noisy bridge labels can be informative without being trustworthy as universal gates.
The comparator layer is now scientifically useful rather than decorative. No single external method dominates the benchmark. CellTypist is the strongest example of why this benchmark exists: it shows stable control-separation behavior on the five primary lanes, yet its canonical runs already fail claim evaluation and it collapses on the CITE protein-backed metric. SingleR is stronger on pancreas and CITE but still fails Kang and pancreas control separation. Azimuth RNA is strongest on the PBMC-family lanes and improves CITE orthogonal agreement substantially, yet it remains borderline on PBMCSCA. The benchmark is therefore distinguishing method behavior, not just workflow self-consistency.
The calibration outputs add another layer of discipline. The benchmark-level claim-margin CI excludes zero, the sign-flip null supports non-random separation, and the emitted sensitivity curve shows the expected monotone increase in scaled degradation as perturbation strength rises. That curve is especially informative for near-boundary lanes such as Kang, where weaker observed margins only cross the 0.05 degradation threshold under moderate or stronger synthetic corruption, matching the lane’s smaller live compatible-control margins.
Multiome is also more informative as a diagnostic non-promotion than it would be as a forced second gate. The lane passes the primary endpoint and supports same-bundle chromatin-backed comparisons, but the current gene-activity and bridge-style orthogonal paths do not separate cleanly enough under compatible controls to justify promotion. Reporting that empirical non-promotion is more credible than silently upgrading chromatin evidence to match the narrative.
Limitations
This version is materially stronger than the original PBMC3k-only note, but it is not yet the absolute ceiling.
multiome_pbmcandpancreas_integrationare now active, but multiome chromatin evidence remains diagnostic rather than promoted to a hard gate- same-bundle comparator evidence now spans
CellTypist,SingleR,AzimuthRNA, andAzimuth ATAC, but pancreas still has only one conclusive comparator (SingleR) whileCellTypistremainsinconclusive_ontology_mismatch - no external comparator dominates the benchmark:
CellTypistfails canonical claim evaluation on all five primary lanes,SingleRfails control separation on Kang and pancreas, andAzimuthRNA fails control separation on PBMCSCA tabula_sapiens_subsetremains planned- the CITE
FCGR3A Monoambiguity remains diagnostic rather than fully resolved - PBMCSCA external label-backed metrics are not yet strong enough to serve as hard gates
- the active set is still PBMC-heavy despite the addition of pancreas
These are real limits and should be stated plainly.
Conclusion
The strongest current version of this repository is no longer “we ran Scanpy on PBMC3k.” It is that a six-lane, same-bundle benchmark with four live comparator families can distinguish robust from fragile single-cell conclusions by combining calibrated claim-score control separation, empirically justified lane-specific external gates, and explicit cross-method failure analysis. The most important negative result is not a crashed workflow but a stable-but-biologically-failing comparator: CellTypist remains degradable under controls while its canonical runs already miss the claim endpoint and the CITE protein-backed signal. Together with the 0/5 naive uniform-gate ablation, that makes the benchmark’s main contribution methodological rather than rhetorical.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: scrna-claim-stability-benchmark
description: Regenerate the pinned scRNA-seq claim-stability benchmark bundle from existing outputs, or reproduce it from cold start by fetching freezes, running the benchmark, refreshing comparators, and rebuilding paper artifacts.
allowed-tools: Bash(uv *, python *, ls *, test *, shasum *, Rscript *, tectonic *)
requires_python: "3.12.x"
package_manager: uv
repo_root: .
canonical_output_dir: outputs/benchmark_live_20260411_ceiling
---
# Calibrated scRNA-seq Claim-Stability Benchmark
This skill has two honest reproducibility modes:
- `Path A`: regenerate calibration, comparators, tables, figures, and manifest from the existing pinned bundle at `outputs/benchmark_live_20260411_ceiling/`
- `Path B`: reproduce the benchmark from cold start by fetching freeze data, validating freezes, running the benchmark without `--skip-run`, then rebuilding the same paper-facing artifact set
The canonical PBMC3k run remains the easiest sanity lane, not the repository headline.
## Runtime Expectations
- Platform: CPU-only
- Python: 3.12.x
- R: 4.5.3 with `Rscript`
- Package manager: `uv`
- Canonical benchmark protocol: `config/benchmark_protocol.yaml`
- Canonical paper bundle root: `outputs/benchmark_live_20260411_ceiling`
- Comparator runtime lock: `config/r_comparator_runtime_lock.json`
- Comparator reference registry: `config/comparator_references.yaml`
- Azimuth reference manifest: `data/benchmark/comparator_references/azimuth/reference_manifest.json`
- Network access is required for `Path B` and for comparator reference materialization
- Fresh clones should assume `outputs/` is absent and `data/benchmark/freeze/` may need to be fetched or rebuilt
- Cold-start reproduction is multi-hour and requires multi-GB local disk for frozen inputs, comparator references, and run outputs
## Reproducibility Modes
### Path A: Regenerate From the Existing Pinned Bundle
Use this path when the precomputed canonical runs and control runs already exist locally. This is the correct path for refreshing calibration summaries, comparator outputs, paper tables, figures, and the release manifest from the pinned April 11 bundle. The `--skip-run` benchmark command below is only valid if the default run directories already exist under `outputs/`.
Preconditions:
- `outputs/benchmark_live_20260411_ceiling/benchmark_summary.json` already exists, or the canonical and control run directories referenced by the protocol already exist under `outputs/`
- `data/benchmark/freeze/` already contains the frozen benchmark inputs
### Path B: Full Cold-Start Reproduction
Use this path from a fresh clone or any environment where `outputs/` is empty. This path fetches the real benchmark inputs, validates the freeze contracts, runs the benchmark without `--skip-run`, then regenerates the paper-facing bundle. This is the correct end-to-end reproducibility claim; it is slower and requires network, disk, and installed R comparators.
## Shared Step 1: Install the Locked Environment
```bash
uv sync --frozen
Rscript scripts/install_r_comparators.R
uv run --frozen --no-sync python scripts/materialize_azimuth_references.py
```
Success condition:
- `uv` completes without changing the lockfile
- `config/r_comparator_runtime_lock.json` exists
- `data/benchmark/comparator_references/azimuth/reference_manifest.json` exists
## Path A Step 2: Refresh the Pinned Paper Bundle
```bash
uv run --frozen --no-sync scrna-skill run-benchmark \
--protocol config/benchmark_protocol.yaml \
--out outputs/benchmark_live_20260411_ceiling \
--freeze-root data/benchmark/freeze \
--skip-run \
--include-negative-controls \
--include-calibration \
--no-verification-rerun \
--bootstrap-reps 2000 \
--null-reps 10000
```
Success condition:
- `outputs/benchmark_live_20260411_ceiling/benchmark_summary.json` exists
- `outputs/benchmark_live_20260411_ceiling/control_separation.json` exists
- `outputs/benchmark_live_20260411_ceiling/calibration_summary.json` exists
- `outputs/benchmark_live_20260411_ceiling/claim_score_empirical_null_summary.json` exists
## Path A Step 3: Refresh the Same-Bundle Comparator Surface
```bash
uv run --frozen --no-sync scrna-skill run-comparators \
--protocol config/benchmark_protocol.yaml \
--benchmark-summary outputs/benchmark_live_20260411_ceiling/benchmark_summary.json \
--out outputs/benchmark_live_20260411_ceiling/comparators_merge_test \
--methods celltypist singler azimuth
uv run --frozen --no-sync scrna-skill run-comparators \
--protocol config/benchmark_protocol.yaml \
--benchmark-summary outputs/benchmark_live_20260411_ceiling/benchmark_summary.json \
--out outputs/benchmark_live_20260411_ceiling/comparators_azimuth_atac_probe \
--methods azimuth_atac \
--dataset multiome_pbmc
uv run --frozen --no-sync python scripts/merge_comparator_bundles.py \
--protocol config/benchmark_protocol.yaml \
--benchmark-summary outputs/benchmark_live_20260411_ceiling/benchmark_summary.json \
--out outputs/benchmark_live_20260411_ceiling/comparators \
--bundle outputs/benchmark_live_20260411_ceiling/comparators_merge_test/comparator_summary.json \
--bundle outputs/benchmark_live_20260411_ceiling/comparators_azimuth_atac_probe/comparator_summary.json
```
Success condition:
- `outputs/benchmark_live_20260411_ceiling/comparators/comparator_summary.json` exists
- `outputs/benchmark_live_20260411_ceiling/comparators/comparator_scores.csv` exists
- the final paper-facing comparator surface is only the canonical bundle-root `comparators/` directory
- `comparators_*probe`, `comparators_merge_test`, `paper_tables_merge_test`, `paper_figures_merge_test`, and `comparator_summaries/` are scratch surfaces and must not be cited by the paper
## Path A Step 4: Build Paper-Facing Artifacts
```bash
uv run --frozen --no-sync scrna-skill build-paper-tables \
--summary outputs/benchmark_live_20260411_ceiling/benchmark_summary.json \
--out outputs/benchmark_live_20260411_ceiling/paper_tables
uv run --frozen --no-sync scrna-skill build-benchmark-figures \
--summary outputs/benchmark_live_20260411_ceiling/benchmark_summary.json \
--out outputs/benchmark_live_20260411_ceiling/paper_figures
cd paper && tectonic main.tex
```
Success condition:
- `outputs/benchmark_live_20260411_ceiling/paper_tables` exists
- `outputs/benchmark_live_20260411_ceiling/paper_figures` exists
- `outputs/benchmark_live_20260411_ceiling/paper_figures/release_manifest.json` exists
- `paper/main.pdf` is rebuilt from `paper/main.tex`
## Path A Step 5: Verify the Final Manifest-Backed Bundle
```bash
uv run --frozen --no-sync python - <<'PY'
import hashlib
import json
from pathlib import Path
repo_root = Path.cwd()
manifest_path = repo_root / "outputs/benchmark_live_20260411_ceiling/paper_figures/release_manifest.json"
manifest = json.loads(manifest_path.read_text())
for item in manifest["artifact_hashes"]:
path = repo_root / item["path"]
observed = hashlib.sha256(path.read_bytes()).hexdigest()
if observed != item["sha256"]:
raise SystemExit(f"hash mismatch: {item['path']}")
print(f"verified {len(manifest['artifact_hashes'])} artifacts")
PY
```
Success condition:
- every hashed artifact in the release manifest verifies cleanly
- only `comparators/`, `paper_tables/`, and `paper_figures/` are treated as canonical paper-facing bundle surfaces
## Path B Step 2: Fetch and Freeze the Benchmark Inputs
```bash
uv run --frozen --no-sync scrna-skill build-freeze-data \
--protocol config/benchmark_protocol.yaml \
--dataset all \
--out data/benchmark/freeze
uv run --frozen --no-sync scrna-skill build-freeze \
--protocol config/benchmark_protocol.yaml \
--freeze-root data/benchmark/freeze
```
Success condition:
- `data/benchmark/freeze/*/canonical_input.h5ad` exists for every active lane
- each freeze directory contains `freeze_audit.json`
## Path B Step 3: Run the Benchmark From Scratch
```bash
uv run --frozen --no-sync scrna-skill run-benchmark \
--protocol config/benchmark_protocol.yaml \
--out outputs/benchmark_live_20260411_ceiling \
--freeze-root data/benchmark/freeze \
--include-negative-controls \
--include-calibration \
--bootstrap-reps 2000 \
--null-reps 10000
```
Success condition:
- canonical run directories are created for every active dataset
- `outputs/benchmark_live_20260411_ceiling/benchmark_summary.json` exists
- `outputs/benchmark_live_20260411_ceiling/control_separation.json` exists
- `outputs/benchmark_live_20260411_ceiling/calibration_summary.json` exists
- `outputs/benchmark_live_20260411_ceiling/claim_score_empirical_null_summary.json` exists
## Path B Step 4: Refresh the Same-Bundle Comparator Surface
Use the same comparator commands as `Path A Step 3` after the cold-start benchmark run has finished.
## Path B Step 5: Build Paper-Facing Artifacts
Use the same paper-build commands as `Path A Step 4`.
## Path B Step 6: Verify the Final Manifest-Backed Bundle
Use the same manifest-verification command as `Path A Step 5`.
## Optional PBMC3k Sanity Lane
```bash
test -f data/pbmc3k_raw.h5ad
shasum -a 256 data/pbmc3k_raw.h5ad
uv run --frozen --no-sync scrna-skill run --config config/canonical_pbmc3k.yaml --out outputs/canonical
uv run --frozen --no-sync scrna-skill verify --run-dir outputs/canonical
```
Expected PBMC3k SHA256:
```text
89a96f1beaa2dd83a687666d3f19a4513ac27a2a2d12581fcd77afed7ea653a1
```
## Required Benchmark Artifacts
- `outputs/canonical/manifest.json`
- `outputs/canonical/qc_summary.json`
- `outputs/canonical/resolution_sweep.csv`
- `outputs/canonical/cluster_markers.csv`
- `outputs/canonical/cluster_annotations.csv`
- `outputs/canonical/umap_clusters.png`
- `outputs/canonical/umap_annotations.png`
- `outputs/canonical/marker_dotplot.png`
- `outputs/canonical/pbmc3k_annotated.h5ad`
- `outputs/canonical/verification.json`
- `outputs/benchmark_live_20260411_ceiling/benchmark_summary.json`
- `outputs/benchmark_live_20260411_ceiling/control_separation.json`
- `outputs/benchmark_live_20260411_ceiling/external_validity_summary.json`
- `outputs/benchmark_live_20260411_ceiling/calibration_summary.json`
- `outputs/benchmark_live_20260411_ceiling/claim_score_bootstrap_intervals.csv`
- `outputs/benchmark_live_20260411_ceiling/claim_score_empirical_null_summary.json`
- `outputs/benchmark_live_20260411_ceiling/claim_score_sensitivity_curve.csv`
- `outputs/benchmark_live_20260411_ceiling/comparators/comparator_summary.json`
- `outputs/benchmark_live_20260411_ceiling/comparators/comparator_scores.csv`
- `outputs/benchmark_live_20260411_ceiling/paper_tables/claim_score_definition_table.md`
- `outputs/benchmark_live_20260411_ceiling/paper_tables/frozen_policy_table.md`
- `outputs/benchmark_live_20260411_ceiling/paper_tables/uniform_gate_ablation_table.md`
- `outputs/benchmark_live_20260411_ceiling/paper_figures/release_manifest.json`
- `config/r_comparator_runtime_lock.json`
- `config/comparator_references.yaml`
- `data/benchmark/comparator_references/azimuth/reference_manifest.json`
## Success Criteria
The benchmark path is successful only if:
- the benchmark command finishes successfully
- negative-control summaries and calibration outputs are written
- same-bundle comparator outputs are regenerated under `outputs/benchmark_live_20260411_ceiling/comparators/`
- paper tables and figures are generated from the same canonical artifact bundle
- the release manifest records the exact benchmark, comparator, merge, and paper-build commands used for the bundle
- the optional PBMC3k sanity lane still verifies cleanly
For avoidance of doubt:
- `Path A` is a pinned-bundle regeneration path, not a full from-zero workflow execution path
- `Path B` is the full cold-start reproduction path
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.