Program-Conditioned Reproducibility of Transcriptomic Signatures Is Underestimated by Cross-Context Benchmarks
Program-Conditioned Reproducibility of Transcriptomic Signatures Is Underestimated by Cross-Context Benchmarks
Abstract
Gene expression signatures are routinely dismissed as irreproducible when they fail cross-context validation—but how much of that apparent irreproducibility is a measurement artifact? We decompose Cochran's Q into within-program and between-program components across 7 MSigDB Hallmark signatures scored in 30 GEO cohorts (5 biological programs). Between-program heterogeneity accounts for 39% of total Q (median 42%; all p < 10^-5), rising to 61% for interferon-γ. IFN-γ exemplifies the phenomenon: within-program Hedges' g = +1.0 (HKSJ-Bonferroni p = 0.003, 100% LOO accuracy), while outside-program g = +0.18 (NS). The benchmark, decomposition framework, and all 30 frozen cohorts are released as a public resource.
Introduction
Gene expression signatures routinely fail validation outside their discovery context venet2011,fan2006,chibon2013,wirapati2008. These failures are interpreted as irreproducibility. But a signature replicating across 6 interferon-response cohorts yet showing no effect in breast cancer studies is context-specific, not broken—cross-context testing conflates the two. The standard diagnostic, I^2 borenstein2009, quantifies total heterogeneity without distinguishing within-program instability from between-program context differences. When cohorts from unrelated programs are pooled, I^2 measures both, generating systematic false negatives. We decompose Cochran's Q into within-program (Q_W) and between-program (Q_B) components borenstein2009 across 7 Hallmark signatures in 30 GEO cohorts organized into 5 biological programs, quantifying for the first time how much apparent heterogeneity is a measurement artifact of context mixing.
Data
Cohort panel. 30 frozen GEO cohorts (5,451 samples) across 5 programs: inflammation (k=7, N=1,976), interferon (k=6, N=1,991), proliferation (k=5, N=847), hypoxia (k=6, N=239), EMT (k=6, N=398). Platforms: 20 Affymetrix, 4 Agilent, 6 Illumina. All GEO accessions and GPL IDs are listed in the supplement. GSE3494 removed (MKI67 median-split circularity with E2F). Signatures. 29 signatures (22 primary, 7 blind): 7 MSigDB Hallmarks liberzon2015; 4 brittle; 3 mixed-program; 5 confounded (including 2 stealth); 3 insufficient-coverage; 7 blind holdouts.
Cohort Panel (30 GEO cohorts)
| Cohort | GEO | Platform | Program | N |
|---|---|---|---|---|
| Sepsis Mortality Gse65682 | GSE65682 | Affymetrix (GPL13667) | inflammation | 479 |
| Breast Prognosis Gse2034 | GSE2034 | Affymetrix (GPL96) | proliferation | 286 |
| Breast Er Gse7390 | GSE7390 | Affymetrix (GPL96) | proliferation | 198 |
| Hypoxia Cellline Gse53012 | GSE53012 | Affymetrix (GPL570) | hypoxia | 27 |
| Sepsis Blood Gse28750 | GSE28750 | Affymetrix (GPL570) | inflammation | 30 |
| Tb Blood Gse19491 | GSE19491 | Illumina (GPL6947) | interferon | 155 |
| Ipf Lung Gse47460 | GSE47460 | Agilent (GPL14550) | emt | 284 |
| Copd Lung Gse47460 | GSE47460 | Agilent (GPL6480) | inflammation | 92 |
| Influenza Pbmc Gse101702 | GSE101702 | Agilent (GPL21185) | interferon | 159 |
| Trauma Blood Gse36809 | GSE36809 | Affymetrix (GPL570) | inflammation | 857 |
| Hcc Liver Gse6764 | GSE6764 | Affymetrix (GPL570) | proliferation | 48 |
| Crohn Intestine Gse112366 | GSE112366 | Affymetrix (GPL13158) | inflammation | 388 |
| Rsv Blood Gse34205 | GSE34205 | Affymetrix (GPL570) | interferon | 73 |
| Viral Challenge Gse73072 | GSE73072 | Affymetrix (GPL14604) | interferon | 1133 |
| Breast Relapse Gse1456 | GSE1456 | Affymetrix (GPL96) | proliferation | 159 |
| Hypoxia Mcf7 Gse3188 | GSE3188 | Affymetrix (GPL570) | hypoxia | 12 |
| Hypoxia Timecourse Gse47533 | GSE47533 | Illumina (GPL6884) | hypoxia | 12 |
| Hypoxia Multicell Gse18494 | GSE18494 | Affymetrix (GPL9419) | hypoxia | 36 |
| Emt Tgfb Gse17708 | GSE17708 | Affymetrix (GPL570) | emt | 15 |
| Emt Hmle Gse24202 | GSE24202 | Affymetrix (GPL3921) | emt | 21 |
| Emt Mammary Gse43495 | GSE43495 | Illumina (GPL6883) | emt | 15 |
| Sepsis Shock Gse95233 | GSE95233 | Affymetrix (GPL570) | inflammation | 73 |
| Melioidosis Blood Gse69528 | GSE69528 | Illumina (GPL10558) | inflammation | 57 |
| Lung Tumor Gse19188 | GSE19188 | Affymetrix (GPL570) | proliferation | 156 |
| Influenza Challenge Gse68310 | GSE68310 | Illumina (GPL10558) | interferon | 282 |
| Ccrcc Kidney Gse36895 | GSE36895 | Affymetrix (GPL570) | hypoxia | 52 |
| Gbm Brain Gse4290 | GSE4290 | Affymetrix (GPL570) | hypoxia | 100 |
| Emt Arpe19 Gse12548 | GSE12548 | Affymetrix (GPL570) | emt | 15 |
| Ipf Lung Gse53845 | GSE53845 | Agilent (GPL6480) | emt | 48 |
| Influenza Severe Gse111368 | GSE111368 | Illumina (GPL10558) | interferon | 189 |
Method
Per-cohort scoring. Weighted signed mean (ssGSEA barbie2009,subramanian2005*). Effect size: Hedges' g (small-sample-corrected Cohen's d; J = 1 - 3/[4(n_1+n_2)-9]) with Var(g) = J^2[(n_1+n_2)/(n_1 n_2) + g^2/\2(n_1+n_2)]. I^2 decomposition. For each Hallmark, 30 per-cohort effects are partitioned by program (K=*5). Q_total = Q_W + Q_B, where Q_W = _k=1^K Q_k and Q_B = Q_total - Q_W borenstein2009. Q_B tested against χ^2(K-1). Within-program meta-analysis. DerSimonian-Laird (DL) random-effects within matched program (k cohorts) and outside (30-*k). Program assignments fixed before computing effects. Bonferroni correction (α = 0.05/9). HKSJ t-distribution reported as robustness check for the primary exemplars.
Results
I^2 Decomposition: Context Explains 39% of Heterogeneity table[H] 4pt tabularlrrrrccc Signature & Q_tot & Q_W & Q_B & Q_B / Q_tot & I^2_tot & I^2_W & p_B IFN-α & 393.5 & 145.3 & 248.3 & 0.63 & 0.93 & 0.83 & <*10^-6 IFN-γ & 429.7 & 169.6 & 260.1 & 0.61 & 0.93 & 0.85 & <*10^-6 TNFα/NFκB & 335.9 & 155.8 & 180.1 & 0.54 & 0.91 & 0.84 & <10^-6 Inflammatory & 387.1 & 225.1 & 162.0 & 0.42 & 0.93 & 0.89 & <10^-6 EMT & 241.5 & 176.7 & 64.7 & 0.27 & 0.88 & 0.86 & <10^-13 Hypoxia & 574.4 & 461.7 & 112.7 & 0.20 & 0.95 & 0.95 & <10^-6 E2F Targets & 448.4 & 416.6 & 31.7 & 0.07 & 0.94 & 0.94 & 2.2x10^-6 Mean / Median & & & & 0.39 / 0.42 & & & tabular ^2 decomposition. All p_B < 10^-5. IFN signatures: 60–63% of heterogeneity is context-driven. tab:i2decomp table Across all 7 Hallmarks, 39% of total Q (median 42%) is between-program heterogeneity—context differences, not signature instability. For IFN signatures, 60–63% of what I^2 reports as irreproducibility'' is context mixing. The I^2 of 0.93 for IFN-γ drops to 0.85 within interferon cohorts. Within-Program Durability` [H] 3pt tabularlcrrrccrr & 4cWithin-Program (DL) & & 3cOutside-Program Signature & g & p & p_Bonf & I^2 & k & g & p & k IFN-γ & +1.003 & <.001 & **<**.001* & 0.72 & 6 & +0.177 & .245 & 24 IFN-α & +1.189 & <.001 & **<**.001* & 0.88 & 6 & +0.228 & .070 & 24 Hypoxia & +3.545 & .0003 & **.003** & 0.95 & 6 & +0.706 & <*.001 & 24 EMT & +2.508 & .0002 & **.002** & 0.90 & 6 & +0.457 & <*.001 & 24 Inflammatory & +0.746 & .007 & .064 & 0.91 & 7 & +0.092 & NS & 23 TNFα/NFκB & +0.548 & .009 & .085 & 0.85 & 7 & +0.123 & NS & 23 E2F Targets & +0.627 & .295 & 1.000 & 0.98 & 5 & +0.454 & .001 & 25 tabular Within- vs.\ outside-program DL random-effects (Hedges' g). Bold: DL-Bonferroni-significant (α = 0.0056). tab:within table **IFN-γ: the cleanest exemplar.** Within interferon: g = +1.003, I^2 = 0.72. Outside: g = +0.177 (NS). LOO predicts held-out direction 6/6 (100%). Under HKSJ t-distribution with Bonferroni: p_B = 0.003—survives the most conservative inference. IFN-α survives HKSJ at nominal significance (p = 0.001) but not after Bonferroni (p_B = 0.012 > 0.0056), placing it one tier below IFN-γ. **Hypoxia and EMT.** DL-Bonferroni-significant (p_B = 0.003, 0.002), 100% LOO, but do not survive HKSJ (p_B = 0.81, 0.31) due to high within-program I^2. Dependent on small cell-line cohorts; one (GSE47533) produces g = 15.1 from near-zero variance. Winsorizing preserves the DL estimate (g: 3.54 → 3.27, still significant). **E2F.** Fails significance (p = 0.30, I^2 = 0.98): proliferation'' lumps tumor-vs-normal (g = +2.3) with subtype contrasts (g = -1.2)—a genuine biological heterogeneity requiring sub-program stratification. Biological Cross-Talk* Inflammatory paradox. The Inflammatory Response and TNFα/NFκB Hallmarks produce their largest effects not in inflammation cohorts but in EMT cohorts (|g| = 1.50 and 0.87), exceeding on-program effects (|g| = 0.78 and 0.57). The IPF lung cohorts—where EMT and inflammatory remodeling co-occur—drive this pattern. This manifests in the decomposition as moderate Q_B/Q_ *` (0.42, 0.54), revealing genuine cross-talk rather than pure context specificity. Program structure is real, not imposed. Permuting program labels 10,000 times while preserving group sizes, observed Q_B/Q_ exceeded the 99th percentile of the null for 3/7 Hallmarks: IFN-γ (observed 0.60 vs.\ null 99th 0.54, p = 0.003), IFN-α (p = 0.003), TNFα (p = 0.003). Inflammatory reaches p = 0.011. E2F, Hypoxia, and EMT do not reject the null, consistent with their high within-program I^2. Venet null calibration. 200 random signatures: single-cohort FPR = 44.6% (matching Venet et al. venet2011); within-program DL meta reduces this to 23.3%. Real Hallmarks show within-program |g| = 0.55–3.55 vs.\ null ceiling |g| = 1.88—the framework discriminates by effect size, not p-value alone.
Limitations
Cohort-to-program assignments were fixed before effect computation; the I^2 decomposition (Table tab:i2decomp) is independent of ground-truth labels. Under HKSJ, only IFN-γ retains Bonferroni significance; IFN-α survives at nominal p but not after correction. Hypoxia and EMT lose significance under both HKSJ and N<20 exclusion. One cohort (GSE47533) produces g = 15.1; Winsorized results reported. All 30 cohorts are microarray. E2F requires sub-program stratification.
Conclusion
I^2 in gene signature meta-analysis is substantially inflated by context mixing: 39% of total Q (median 42%) is between-program heterogeneity. IFN-γ survives the most conservative inference (HKSJ-Bonferroni p = 0.003) within interferon cohorts while appearing null cross-context. Reported I^2 values for gene signatures should be accompanied by program-conditioned decomposition; signatures dismissed as irreproducible by cross-context benchmarks may warrant re-evaluation within their biological context. Program labels themselves are approximations; the decomposition exposes where they break down.
References
- Venet D, et al. PLoS Comput Biol. 2011;7(10):e1002240. doi:10.1371/journal.pcbi.1002240
- Fan C, et al. N Engl J Med. 2006;355(6):560-569. doi:10.1056/NEJMoa052933
- Chibon F. Eur J Cancer. 2013;49(8):2000-2009. doi:10.1016/j.ejca.2013.02.021
- Wirapati P, et al. Breast Cancer Res. 2008;10(4):R65. doi:10.1186/bcr2124
- Subramanian A, et al. Proc Natl Acad Sci. 2005;102(43):15545-15550. doi:10.1073/pnas.0506580102
- Barbie DA, et al. Nature. 2009;462:108-112. doi:10.1038/nature08460
- Liberzon A, et al. Cell Syst. 2015;1(6):417-425. doi:10.1016/j.cels.2015.12.004
- Borenstein M, et al. Introduction to Meta-Analysis. Wiley; 2009. doi:10.1002/9780470743386
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: signature-durability-benchmark description: Score human gene signatures against frozen real GEO cohorts to determine cross-cohort transcriptomic durability with self-verification and confounder rejection. allowed-tools: Bash(uv *, python *, python3 *, ls *, test *, shasum *, tectonic *) requires_python: "3.12.x" package_manager: uv repo_root: . canonical_output_dir: outputs/canonical --- # Signature Durability Benchmark This skill scores published gene signatures against 22 frozen real GEO expression cohorts (4,730 samples, 3 microarray platforms) to determine whether each signature is durable, brittle, mixed, confounded, or insufficiently covered across independent cohorts. The full model is compared against 4 baselines with a success rule. ## Runtime Expectations - Platform: CPU-only - Python: 3.12.x - Package manager: uv - Offline after initial clone (all GEO data pre-frozen) ## Step 1: Install the Locked Environment ```bash uv sync --frozen ``` ## Step 2: Build Freeze (Validate Frozen Assets) ```bash uv run --frozen --no-sync signature-durability-benchmark build-freeze --config config/benchmark_config.yaml --out data/freeze ``` Success condition: freeze_audit.json shows valid=true ## Step 3: Run the Canonical Benchmark ```bash uv run --frozen --no-sync signature-durability-benchmark run --config config/benchmark_config.yaml --out outputs/canonical ``` Success condition: outputs/canonical/manifest.json exists ## Step 4: Verify the Run ```bash uv run --frozen --no-sync signature-durability-benchmark verify --config config/benchmark_config.yaml --run-dir outputs/canonical ``` Success condition: verification status is passed ## Step 5: Confirm Required Artifacts Required files in outputs/canonical/: - manifest.json - normalization_audit.json - cohort_overlap_summary.csv - per_cohort_effects.csv - aggregate_durability_scores.csv - matched_null_summary.csv - leave_one_cohort_out.csv - platform_holdout_summary.csv - durability_certificate.json - platform_transfer_certificate.json - confounder_rejection_certificate.json - coverage_certificate.json - benchmark_protocol.json - verification.json - public_summary.md - within_program_durability.csv - forest_plot.png - null_separation_plot.png - stability_heatmap.png - platform_transfer_panel.png ## Scope Rules - Human bulk transcriptomic signatures only - No live data fetching in scored path - Frozen GEO cohorts from real public data - Blind panel never influences thresholds - Source leakage between signature sources and cohort sources is forbidden
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.