Program-Conditioned Reproducibility of Transcriptomic Signatures Is Underestimated by Cross-Context Benchmarks
Program-Conditioned Reproducibility of Transcriptomic Signatures Is Underestimated by Cross-Context Benchmarks
Abstract
Gene expression signatures are routinely dismissed as irreproducible when they fail cross-context validation — but how much of that apparent irreproducibility is a measurement artifact? We decompose Cochran's Q into within-program and between-program components across 7 MSigDB Hallmark signatures scored in 30 GEO cohorts (5 biological programs). Between-program heterogeneity accounts for 39% of total Q (median 42%; all p < 10^-5), rising to 61% for interferon-gamma. IFN-gamma exemplifies the phenomenon: within-program Hedges' g = +1.0 (HKSJ-Bonferroni p = 0.003, 100% LOO accuracy), while outside-program g = +0.18 (NS). The benchmark, decomposition framework, and all 30 frozen cohorts are released as a public resource.
Introduction
Gene expression signatures routinely fail validation outside their discovery context. These failures are interpreted as irreproducibility. But a signature replicating across 6 interferon-response cohorts yet showing no effect in breast cancer studies is context-specific, not broken — cross-context testing conflates the two. The standard diagnostic, I-squared, quantifies total heterogeneity without distinguishing within-program instability from between-program context differences. We decompose Cochran's Q into within-program (Q_W) and between-program (Q_B) components across 7 Hallmark signatures in 30 GEO cohorts organized into 5 biological programs, quantifying for the first time how much apparent heterogeneity is a measurement artifact of context mixing.
Data
30 frozen GEO cohorts (5,451 samples) across 5 programs: inflammation (k=7, N=1,976), interferon (k=6, N=1,991), proliferation (k=5, N=847), hypoxia (k=6, N=239), EMT (k=6, N=398). Platforms: 20 Affymetrix, 4 Agilent, 6 Illumina. 29 signatures (22 primary, 7 blind): 7 MSigDB Hallmarks; 4 brittle; 3 mixed-program; 5 confounded (including 2 stealth); 3 insufficient-coverage; 7 blind holdouts.
Method
Effect size: Hedges' g (small-sample-corrected Cohen's d). I-squared decomposition: Q_total = Q_W + Q_B, with Q_B tested against chi-squared(K-1). Within-program meta-analysis: DerSimonian-Laird random-effects. Bonferroni correction (alpha = 0.05/9). HKSJ t-distribution reported as robustness check.
Results
I-squared Decomposition: Context Explains 39% of Heterogeneity
| Signature | Q_tot | Q_W | Q_B | Q_B/Q_tot | I²_tot | I²_W | p_B |
|---|---|---|---|---|---|---|---|
| IFN-alpha | 393.5 | 145.3 | 248.3 | 0.63 | 0.93 | 0.83 | <10^-6 |
| IFN-gamma | 429.7 | 169.6 | 260.1 | 0.61 | 0.93 | 0.85 | <10^-6 |
| TNFa/NFkB | 335.9 | 155.8 | 180.1 | 0.54 | 0.91 | 0.84 | <10^-6 |
| Inflammatory | 387.1 | 225.1 | 162.0 | 0.42 | 0.93 | 0.89 | <10^-6 |
| EMT | 241.5 | 176.7 | 64.7 | 0.27 | 0.88 | 0.86 | <10^-13 |
| Hypoxia | 574.4 | 461.7 | 112.7 | 0.20 | 0.95 | 0.95 | <10^-6 |
| E2F Targets | 448.4 | 416.6 | 31.7 | 0.07 | 0.94 | 0.94 | 2.2x10^-6 |
| Mean/Median | 0.39/0.42 |
Across all 7 Hallmarks, 39% of total Q (median 42%) is between-program heterogeneity — context differences, not signature instability. For IFN signatures, 60-63% of what I-squared reports as "irreproducibility" is context mixing.
Within-Program Durability
| Signature | Within g | p | p_Bonf | I² | k | Outside g | Outside p | k |
|---|---|---|---|---|---|---|---|---|
| IFN-gamma | +1.003 | <.001 | <.001 | 0.72 | 6 | +0.177 | .245 | 24 |
| IFN-alpha | +1.189 | <.001 | <.001 | 0.88 | 6 | +0.228 | .070 | 24 |
| Hypoxia | +3.545 | .0003 | .003 | 0.95 | 6 | +0.706 | <.001 | 24 |
| EMT | +2.508 | .0002 | .002 | 0.90 | 6 | +0.457 | <.001 | 24 |
| Inflammatory | +0.746 | .007 | .064 | 0.91 | 7 | +0.092 | NS | 23 |
| TNFa/NFkB | +0.548 | .009 | .085 | 0.85 | 7 | +0.123 | NS | 23 |
| E2F Targets | +0.627 | .295 | 1.000 | 0.98 | 5 | +0.454 | .001 | 25 |
IFN-gamma: the cleanest exemplar. Within interferon: g = +1.003, I² = 0.72. Outside: g = +0.177 (NS). LOO predicts held-out direction 6/6 (100%). Under HKSJ t-distribution with Bonferroni: p_B = 0.003 — survives the most conservative inference. IFN-alpha survives HKSJ at nominal significance (p = 0.001) but not after Bonferroni (p_B = 0.012), placing it one tier below.
Hypoxia and EMT: DL-Bonferroni-significant but do not survive HKSJ due to high within-program I². One cohort (GSE47533) produces g = 15.1 from near-zero variance; Winsorizing preserves the DL estimate.
E2F: Fails significance (I² = 0.98) because "proliferation" lumps tumor-vs-normal (g = +2.3) with subtype contrasts (g = -1.2).
Biological Cross-Talk
Inflammatory paradox: The Inflammatory Response and TNFa/NFkB Hallmarks produce their largest effects not in inflammation cohorts but in EMT cohorts, exceeding on-program effects. The IPF lung cohorts — where EMT and inflammatory remodeling co-occur — drive this pattern.
Program structure is real, not imposed: Permuting program labels 10,000 times, observed Q_B/Q_tot exceeded the 99th percentile for 3/7 Hallmarks: IFN-gamma (p = 0.003), IFN-alpha (p = 0.003), TNFa (p = 0.003).
Venet null calibration: 200 random signatures: single-cohort FPR = 44.6%; within-program DL meta reduces to 23.3%. Real Hallmarks show within-program |g| = 0.55-3.55 vs null ceiling |g| = 1.88.
Limitations
Cohort-to-program assignments were fixed before effect computation; the I² decomposition is independent of ground-truth labels. Under HKSJ, only IFN-gamma retains Bonferroni significance. Hypoxia and EMT lose significance under both HKSJ and N<20 exclusion. One cohort (GSE47533) produces g = 15.1; Winsorized results reported. All 30 cohorts are microarray. E2F requires sub-program stratification.
Conclusion
I-squared in gene signature meta-analysis is substantially inflated by context mixing: 39% of total Q (median 42%) is between-program heterogeneity. IFN-gamma survives the most conservative inference (HKSJ-Bonferroni p = 0.003) within interferon cohorts while appearing null cross-context. Reported I-squared values for gene signatures should be accompanied by program-conditioned decomposition; signatures dismissed as irreproducible by cross-context benchmarks may warrant re-evaluation within their biological context. Program labels themselves are approximations; the decomposition exposes where they break down.
References
- Venet D, et al. PLoS Comput Biol. 2011;7(10):e1002240. doi:10.1371/journal.pcbi.1002240
- Fan C, et al. N Engl J Med. 2006;355(6):560-569. doi:10.1056/NEJMoa052933
- Chibon F. Eur J Cancer. 2013;49(8):2000-2009. doi:10.1016/j.ejca.2013.02.021
- Wirapati P, et al. Breast Cancer Res. 2008;10(4):R65. doi:10.1186/bcr2124
- Subramanian A, et al. Proc Natl Acad Sci. 2005;102(43):15545-15550. doi:10.1073/pnas.0506580102
- Barbie DA, et al. Nature. 2009;462:108-112. doi:10.1038/nature08460
- Liberzon A, et al. Cell Syst. 2015;1(6):417-425. doi:10.1016/j.cels.2015.12.004
- Borenstein M, et al. Introduction to Meta-Analysis. Wiley; 2009. doi:10.1002/9780470743386
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: signature-durability-benchmark description: Score human gene signatures against frozen real GEO cohorts to determine cross-cohort transcriptomic durability with self-verification and confounder rejection. allowed-tools: Bash(uv *, python *, python3 *, ls *, test *, shasum *, tectonic *) requires_python: "3.12.x" package_manager: uv repo_root: . canonical_output_dir: outputs/canonical --- # Signature Durability Benchmark This skill scores published gene signatures against 22 frozen real GEO expression cohorts (4,730 samples, 3 microarray platforms) to determine whether each signature is durable, brittle, mixed, confounded, or insufficiently covered across independent cohorts. The full model is compared against 4 baselines with a success rule. ## Runtime Expectations - Platform: CPU-only - Python: 3.12.x - Package manager: uv - Offline after initial clone (all GEO data pre-frozen) ## Step 1: Install the Locked Environment ```bash uv sync --frozen ``` ## Step 2: Build Freeze (Validate Frozen Assets) ```bash uv run --frozen --no-sync signature-durability-benchmark build-freeze --config config/benchmark_config.yaml --out data/freeze ``` Success condition: freeze_audit.json shows valid=true ## Step 3: Run the Canonical Benchmark ```bash uv run --frozen --no-sync signature-durability-benchmark run --config config/benchmark_config.yaml --out outputs/canonical ``` Success condition: outputs/canonical/manifest.json exists ## Step 4: Verify the Run ```bash uv run --frozen --no-sync signature-durability-benchmark verify --config config/benchmark_config.yaml --run-dir outputs/canonical ``` Success condition: verification status is passed ## Step 5: Confirm Required Artifacts Required files in outputs/canonical/: - manifest.json - normalization_audit.json - cohort_overlap_summary.csv - per_cohort_effects.csv - aggregate_durability_scores.csv - matched_null_summary.csv - leave_one_cohort_out.csv - platform_holdout_summary.csv - durability_certificate.json - platform_transfer_certificate.json - confounder_rejection_certificate.json - coverage_certificate.json - benchmark_protocol.json - verification.json - public_summary.md - within_program_durability.csv - forest_plot.png - null_separation_plot.png - stability_heatmap.png - platform_transfer_panel.png ## Scope Rules - Human bulk transcriptomic signatures only - No live data fetching in scored path - Frozen GEO cohorts from real public data - Blind panel never influences thresholds - Source leakage between signature sources and cohort sources is forbidden
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.