Cross-Cohort Transfer Readiness Is Unverified in Published Oral Microbiome Studies: A Formal Audit Framework
Cross-Cohort Transfer Readiness Is Unverified in Published Oral Microbiome Studies: A Formal Audit Framework
Abstract
Oral microbiome classifiers for periodontitis routinely report high within-study discrimination yet are deployed without formal assessment of whether their training cohort's characteristics permit generalization to other populations. We formalize the transfer-readiness problem as a four-gate audit: label provenance, cross-validation identifiability, distributional shift, and model recommendation. We implement this audit as a deterministic, reproducible pipeline and apply it to the publicly recoverable EPheClass PD_s saliva backbone (722 samples, 9 cohorts). The audit retains only 2 mixed primary cohorts (102 samples), finds both materially shifted, and determines that cross-cohort tuning is underidentified. The Ridge baseline (AUPRC = 0.924) exceeds the full sparse model (0.897). An independent 3-cohort saliva periodontitis panel passes identifiability (K_mix = 3) but fails on distributional shift, while HMP oral 16S and CRC gut panels pass all gates. Threshold sweeps (12 configurations) confirm failures are structurally determined. Two separable failure modes emerge—insufficient cohort geometry and unrecoverable shift—while oral 16S data are not intrinsically non-transferable.
Introduction
Salivary microbiome signatures have been proposed as non-invasive diagnostic biomarkers for periodontitis . Studies using 16S rRNA amplicon sequencing typically report high area under the precision-recall curve (AUPRC) or receiver operating characteristic (AUROC) on internal cross-validation, yet independent replication across cohorts remains rare and informal . When classifiers trained on one cohort are evaluated on another, performance often degrades substantially—a phenomenon documented in metagenomic classifier benchmarks but largely unaddressed in oral microbiome research. The core problem is that published studies do not formally assess transfer readiness: whether the cohort structure, sample provenance, and distributional properties of a training panel are sufficient to support generalization claims. Without this assessment, positive results are uninterpretable for clinical translation. The complete audit executes deterministically from a cold-start SKILL.md on CPU-only hardware. We address a single question: given a publicly recoverable oral microbiome panel, can we determine—before deployment—whether sparse cross-cohort transfer claims are justified? We formalize transfer readiness as a four-gate deterministic audit and apply it to the largest publicly available saliva periodontitis dataset. Method Data Recovery and Panel Construction We recover the EPheClass PD_s ASV abundance matrix (N = 796 samples across 10 cohorts) from public repositories. Sample-level metadata (disease labels, cohort membership, age, sex, smoking status) is reconstructed from NCBI SRA run records and BioSample annotations . Samples are retained only when: (i) the disease label is non-empty and not flagged as ambiguous; (ii) the source URL and source record ID are both non-empty. Cohorts failing auditable label recovery are excluded entirely rather than imputed. The retained 722 samples are partitioned into four panels: table[H] 4pt Panel composition of the canonical public-recovery bundle. tab:panel tabular** @ @ Panel & Cohorts & Samples & Control & Periodontitis & Cohort IDs Primary mixed & 2 & 102 & 39 & 63 & BP41, BP48 Blind mixed & 2 & 189 & 55 & 134 & BP34, BP49 Auxiliary (single-class) & 5 & 431 & 338 & 93 & BP35, BP36, BP39, BP40, BP44 Excluded & 1 & 74 & — & — & BP43 tabular** Primary cohorts contain both periodontitis and control samples and serve as held-out folds for leave-one-cohort-out (LOCO) evaluation. Blind cohorts are withheld from all tuning, threshold selection, and feature distillation. Auxiliary cohorts contribute training samples but are never used as held-out scoring folds. BP43 (74 samples) is excluded because no auditable sample-level saliva label map is recoverable from public records. Normalization Raw ASV counts X R^n x p are transformed via the centered log-ratio (CLR): equation CLR(x_i) = (x_i + ) - 1p _j=1^p (x_ij + ) eq:clr equation where = 0.5 is a pseudocount for zero replacement . This compositionally aware transformation is applied identically to training and held-out samples. Transfer-Readiness Audit The audit applies four deterministic gates in sequence. A panel must pass all four to receive a transfer\_ready verdict. Gate 1: Label Provenance. For each of n retained samples, the auditor requires: (a) non-empty disease label _i , control; (b) label confidence ambiguous\_excluded; (c) non-empty source URL; (d) non-empty source record ID. Let m be the count of samples with missing provenance fields. The gate passes (auditable) iff m/n < _prov ( _prov = 0.10). Gate 2: Cross-Validation Identifiability. Define K_mix as the number of primary mixed cohorts. In each LOCO outer fold k, the remaining K_mix - 1 primary mixed cohorts form the inner tuning panel; let s_k denote the number of valid inner mixed-cohort splits. The gate passes (reliable) iff: equation K_mix >= _K and _k s_k >= _s eq:cv equation where _K = 3 and _s = 2. Otherwise the verdict is sparse\_transfer\_unreliable. Gate 3: Distributional Shift. For each held-out cohort k with training complement T_k, compute: itemize Library-size ratio: r_lib*^(k) = L_k / L_T_k where L is the mean total count per sample. Nonzero-feature ratio: r_nz*^(k) = F_k / F_T_k where F is the mean number of non-zero features. Prevalence gaps: For the top 10 features by |β*_j| from the abundance model, count features where | _k(j) - prev_T_k(j)| >= 0.30. itemize Cohort k is flagged as shifted if r_lib^(k) not in [0.67, 1.50] or r_nz^(k) not in [0.67, 1.50] or >= 3 top features exceed the prevalence-gap threshold. Euclidean centroid distance d_k = || _k - _T_k||_2 in CLR space provides a supplementary diagnostic. Gate 4: Model Recommendation. Let A_full, A_abund, and A_core denote pooled AUPRC for the elastic net, Ridge baseline, and distilled feature-core models, respectively. The recommendation rule is: equation m = cases none & if Gate 1 fails ridge\_fallback & if Gate 2 fails full\_model & if Gate 2 passes and A_full >= A_abund distilled\_core & if A_core >= A_abund + 0.005 ridge\_fallback & otherwise cases eq:recommendation equation Classification Models Full model. Elastic net logistic regression with balanced class weights, selected via leave-one-group-out CV over C \0.01, 0.1, 1, 10\ and _1-ratio \0.1, 0.5, 0.9, optimizing AUPRC . The solver is SAGA with convergence tolerance doubling up to 8x the initial 4000 iterations. Abundance-only baseline. L_2-regularized logistic regression (C = 1.0, liblinear solver) on CLR-transformed features without elastic net sparsity . Distilled feature core. Features are ranked by cross-fold selection frequency (>= 0.50), sign consistency (>= 0.80), and confounder loading. The distilled core model is retrained on the selected feature subset, and its AUPRC drop and confounder-margin improvement are compared against rescue thresholds. Confounder control. For each cohort, confounder loadings are estimated via ^2 (categorical confounders) or | _s| (Spearman rank correlation, continuous confounders) between model scores and confounder values . The aggregate suppression margin is = _disease - ( ^2_batch, ^2_confounders), where _disease is Cliff's between disease-positive and control scores. Results Audit Verdicts table[H] 4pt Transfer-readiness audit outcome for the canonical saliva periodontitis panel. tab:audit tabular** @ @ Audit gate & Criterion & Observed & Verdict Label provenance & Missing fraction < 0.10 & 0/722 = 0.000 & auditable CV identifiability & K_mix >= 3, _k s_k >= 2 & K_mix = 2, _k s_k = 1 & sparse\_transfer\_unreliable Distributional shift & Ratios [0.67, 1.50] & Both cohorts flagged & shifted\_candidate Model recommendation & Per Eq. eq:recommendation & Gate 2 failed & ridge\_fallback tabular** All 722 retained samples pass the label provenance gate with zero missing fields. However, only 2 primary mixed cohorts survive inclusion criteria (< _K = 3), and each LOCO outer fold contains exactly 1 valid inner mixed-cohort split (< _s = 2). The panel therefore fails the CV identifiability gate. Distributional Shift Diagnostics table[H] 4pt Cohort-shift diagnostics for retained primary cohorts. tab:shift tabular** @ @ Cohort & r_lib & r_nz & d_centroid & Gap count & Flag reasons BP41 & 2.780 & 1.811 & 46.27 & 7/10 & lib-size, nz-features, prevalence BP48 & 1.703 & 1.878 & 46.27 & 6/10 & lib-size, nz-features, prevalence tabular** Both primary cohorts exhibit library-size ratios exceeding the 1.50 threshold and nonzero-feature ratios exceeding the same bound. BP41 is the more extreme case (r_lib = 2.78), while BP48 remains substantially shifted (r_lib = 1.70). Both cohorts trigger all three shift criteria, indicating global panel shift rather than a single outlier. Model Comparison table[H] 4pt Leave-one-cohort-out benchmark results on the retained primary panel. tab:models tabular** @ @ Model & Pooled AUPRC & Macro BAcc & Blind BAcc & Conf.\ Margin & Core Gain No-transfer pooled & 0.935 & 0.836 & — & — & — Ridge baseline & 0.924 & 0.799 & 0.627 & — & — Distilled feature core & 0.908 & 0.775 & 0.620 & 0.662 & +0.041 Full elastic net & 0.897 & 0.785 & 0.627 & 0.620 & +0.041 No-confounder control & 0.897 & 0.785 & — & 0.620 & — tabular** The Ridge baseline achieves the highest pooled AUPRC (0.924), exceeding the full elastic net model by 0.027 and the distilled feature core by 0.016. The performance gap is concentrated in BP48, where the full model drops to 0.832 AUPRC versus 0.932 for the baseline. On BP41, the full model modestly outperforms (0.950 vs.\ 0.940), but the BP48 degradation dominates the pooled metric. Blind cohorts (BP34, BP49), withheld from all tuning, produce near-identical balanced accuracy for both models (0.627 full vs.\ 0.627 abundance), confirming that the sparse model recovers no advantage on unseen data. Confounder Analysis The aggregate confounder suppression margin is = 0.620 for the full model, indicating that disease signal (Cliff's ) substantially exceeds the strongest confounder loading. Age is the most informative available confounder (loading 0.365 in BP41, 0.229 in BP48). Smoking, antibiotics, sequencing platform, and dentition proxy are unavailable in BP41 due to sparse metadata (< 20 non-missing values). The distilled feature core improves the margin to 0.662 (+0.042), but its AUPRC remains below the abundance baseline. Discussion The central finding is negative: the largest publicly available saliva periodontitis panel does not support sparse cross-cohort transfer claims under formal audit conditions. This result is scientifically informative rather than nihilistic—it demonstrates that within-study AUPRC, even when high (> 0.89), does not imply transfer readiness. The audit framework formalizes what is otherwise left implicit in microbiome classifier papers. _K = 3 is not an arbitrary threshold but the structural minimum at which inner cross-validation has more than one comparison point; at K_mix = 2, ``tuning'' reduces to single-point estimation. Sweeping all gate thresholds across 12 configurations ( _K \2,3,4,5`, library-size bounds from [0.50, 2.00] to [0.80, 1.25]), zero produce a passing verdict: at _K = 2 the failure shifts from Gate 2 to Gate 3 (library-size ratios of 2.78x and 1.70x exceed every tested bound). The verdict is structurally determined, not threshold-dependent. Both models converge to 0.627 balanced accuracy on blind cohorts (spread 0.007). The honest conclusion is that no model transfers reliably from this panel; the Ridge baseline recommendation reflects the structural identifiability constraint, not empirical superiority of simpler features. Failure taxonomy across 4 real panels. table[H] 3pt tabularllllccl Panel & Modality & Site & K_mix & Gate 2 & Gate 3 & Verdict EPheClass & 16S & saliva & 2 & FAIL & shifted & unreliable* 3-cohort saliva & 16S & saliva & 3 & pass & **shifted** & shift_blocked* HMP oral & 16S & oral & 3 & pass & stable & transfer\_ready CRC gut & meta & gut & 4 & pass & stable & transfer\_ready tabular Failure taxonomy. EPheClass fails Gate 2 (cohort count). The 3-cohort saliva panel passes Gate 2 but fails Gate 3 (distributional shift). HMP and CRC pass all gates. tab:taxonomy table The 3-cohort saliva panel (USA/Sweden/S. Korea, 160 samples) separates identifiability from shift: K_mix = 3 is adequate, yet the South Korean cohort's distributional shift blocks transfer. HMP oral 16S (603 samples, 4 centers) and CRC gut (575 samples, 5 cohorts) both pass unchanged, confirming oral 16S data are not intrinsically non-transferable. Limitations. (1) The periodontitis panel is restricted to saliva (102 samples, 2 cohorts). (2) The HMP positive control uses a different task (Saliva vs.\ Plaque) than the periodontitis panels (disease vs.\ control). (3) Gate thresholds are fixed; sensitivity analysis shows the verdict is structurally determined (0/12 threshold configurations produce a pass for periodontitis). (4) Confidence intervals are not reported for AUPRC because the LOCO design with 2 folds does not support stable bootstrap estimation—itself evidence of the identifiability problem. Conclusion We present a formal, deterministic transfer-readiness audit for microbiome cohorts. The two independent saliva periodontitis panels audited here fail transfer readiness for two separable reasons: insufficient cohort geometry (EPheClass, K_mix = 2) and shift_blocked cross-cohort shift (3-cohort panel, K_mix = 3 but shifted). Conversely, the HMP oral 16S positive control passes unchanged, indicating oral 16S data are not intrinsically non-transferable. The framework discriminates. Published microbiome classifier studies should assess transfer readiness before claiming cross-cohort generalization; the four-gate framework provides an executable specification for that assessment. thebibliography99 knight2018 Knight R, Vrbanac A, Taylor BC, et al. Best practices for analysing microbiomes. Nat Rev Microbiol. 2018;16(7):410–422. doi:10.1038/s41579-018-0029-9 huang2014 Huang S, Li R, Zeng X, et al. Predictive modeling of gingivitis severity and susceptibility via oral microbiome. ISME J. 2014;8(9):1768–1780. doi:10.1038/ismej.2014.32 segata2012 Segata N, Haake SK, Mannon P, et al. Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biol. 2012;13(6):R42. doi:10.1186/gb-2012-13-6-r42 debelius2016 Debelius J, Song SJ, Vazquez-Baeza Y, et al. Tiny microbes, enormous impacts: what matters in gut microbiome studies? Genome Biol. 2016;17(1):217. doi:10.1186/s13059-016-1086-x schloss2018 Schloss PD. Identifying and overcoming threats to reproducibility, replicability, robustness, and generalizability in microbiome research. mBio. 2018;9(3):e00525-18. doi:10.1128/mBio.00525-18 regueira2024 Regueira-Iglesias A, Suarez-Rodriguez B, Blanco-Pintos T, et al. The salivary microbiome as a diagnostic biomarker of periodontitis: a 16S multi-batch study before and after the removal of batch effects. Front Cell Infect Microbiol. 2024;14:1405699. doi:10.3389/fcimb.2024.1405699 epheclass Oral Sciences Research Group. EPheClass dataset: ASV abundance matrices for periodontal and gingival classification. GitHub. https://github.com/Oral-Sciences-Research-Group/Epheclass_dataset ncbi_sra NCBI. Sequence Read Archive. https://trace.ncbi.nlm.nih.gov/Traces/study/ ncbi_biosample NCBI. BioSample database. https://www.ncbi.nlm.nih.gov/biosample aitchison1982 Aitchison J. The statistical analysis of compositional data. J R Stat Soc B. 1982;44(2):139–177. doi:10.1111/j.2517-6161.1982.tb01195.x gloor2017 Gloor GB, Macklaim JM, Pawlowsky-Glahn V, Egozcue JJ. Microbiome datasets are compositional: and this is not optional. Front Microbiol. 2017;8:2224. doi:10.3389/fmicb.2017.02224 zou2005 Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B. 2005;67(2):301–320. doi:10.1111/j.1467-9868.2005.00503.x cox1958 Cox DR. The regression analysis of binary sequences. J R Stat Soc B. 1958;20(2):215–242. doi:10.1111/j.2517-6161.1958.tb00292.x cohen1973 Cohen J. Eta-squared and partial eta-squared in fixed factor ANOVA designs. Educ Psychol Meas. 1973;33(1):107–112. doi:10.1177/001316447303300111 wirbel2019 Wirbel J, Pyl PT, Karber E, et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat Med. 2019;25(4):679–689. doi:10.1038/s41591-019-0406-6 hmp2012 Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207–214. doi:10.1038/nature11234 thebibliography document
Data
Method
Data Recovery and Panel Construction We recover the EPheClass PD_s ASV abundance matrix (N = 796 samples across 10 cohorts) from public repositories. Sample-level metadata (disease labels, cohort membership, age, sex, smoking status) is reconstructed from NCBI SRA run records and BioSample annotations . Samples are retained only when: (i) the disease label is non-empty and not flagged as ambiguous; (ii) the source URL and source record ID are both non-empty. Cohorts failing auditable label recovery are excluded entirely rather than imputed. The retained 722 samples are partitioned into four panels: table[H] 4pt Panel composition of the canonical public-recovery bundle. tab:panel tabular** @ @ Panel & Cohorts & Samples & Control & Periodontitis & Cohort IDs Primary mixed & 2 & 102 & 39 & 63 & BP41, BP48 Blind mixed & 2 & 189 & 55 & 134 & BP34, BP49 Auxiliary (single-class) & 5 & 431 & 338 & 93 & BP35, BP36, BP39, BP40, BP44 Excluded & 1 & 74 & — & — & BP43 tabular** Primary cohorts contain both periodontitis and control samples and serve as held-out folds for leave-one-cohort-out (LOCO) evaluation. Blind cohorts are withheld from all tuning, threshold selection, and feature distillation. Auxiliary cohorts contribute training samples but are never used as held-out scoring folds. BP43 (74 samples) is excluded because no auditable sample-level saliva label map is recoverable from public records. Normalization Raw ASV counts X R^n x p are transformed via the centered log-ratio (CLR): equation CLR(x_i) = (x_i + ) - 1p _j=1^p (x_ij + ) eq:clr equation where = 0.5 is a pseudocount for zero replacement . This compositionally aware transformation is applied identically to training and held-out samples. Transfer-Readiness Audit The audit applies four deterministic gates in sequence. A panel must pass all four to receive a transfer\_ready verdict. Gate 1: Label Provenance. For each of n retained samples, the auditor requires: (a) non-empty disease label _i , control; (b) label confidence ambiguous\_excluded; (c) non-empty source URL; (d) non-empty source record ID. Let m be the count of samples with missing provenance fields. The gate passes (auditable) iff m/n < _prov ( _prov = 0.10). Gate 2: Cross-Validation Identifiability. Define K_mix as the number of primary mixed cohorts. In each LOCO outer fold k, the remaining K_mix - 1 primary mixed cohorts form the inner tuning panel; let s_k denote the number of valid inner mixed-cohort splits. The gate passes (reliable) iff: equation K_mix >= _K and _k s_k >= _s eq:cv equation where _K = 3 and _s = 2. Otherwise the verdict is sparse\_transfer\_unreliable. Gate 3: Distributional Shift. For each held-out cohort k with training complement T_k, compute: itemize Library-size ratio: r_lib*^(k) = L_k / L_T_k where L is the mean total count per sample. Nonzero-feature ratio: r_nz*^(k) = F_k / F_T_k where F is the mean number of non-zero features. Prevalence gaps: For the top 10 features by |β*_j| from the abundance model, count features where | _k(j) - prev_T_k(j)| >= 0.30. itemize Cohort k is flagged as shifted if r_lib^(k) not in [0.67, 1.50] or r_nz^(k) not in [0.67, 1.50] or >= 3 top features exceed the prevalence-gap threshold. Euclidean centroid distance d_k = || _k - _T_k||_2 in CLR space provides a supplementary diagnostic. Gate 4: Model Recommendation. Let A_full, A_abund, and A_core denote pooled AUPRC for the elastic net, Ridge baseline, and distilled feature-core models, respectively. The recommendation rule is: equation m = cases none & if Gate 1 fails ridge\_fallback & if Gate 2 fails full\_model & if Gate 2 passes and A_full >= A_abund distilled\_core & if A_core >= A_abund + 0.005 ridge\_fallback & otherwise cases eq:recommendation equation Classification Models Full model. Elastic net logistic regression with balanced class weights, selected via leave-one-group-out CV over C \0.01, 0.1, 1, 10\ and _1-ratio \0.1, 0.5, 0.9, optimizing AUPRC . The solver is SAGA with convergence tolerance doubling up to 8x the initial 4000 iterations. Abundance-only baseline. L_2-regularized logistic regression (C = 1.0, liblinear solver) on CLR-transformed features without elastic net sparsity . Distilled feature core. Features are ranked by cross-fold selection frequency (>= 0.50), sign consistency (>= 0.80), and confounder loading. The distilled core model is retrained on the selected feature subset, and its AUPRC drop and confounder-margin improvement are compared against rescue thresholds. Confounder control. For each cohort, confounder loadings are estimated via ^2 (categorical confounders) or | _s| (Spearman rank correlation, continuous confounders) between model scores and confounder values . The aggregate suppression margin is = _disease - ( ^2_batch, ^2_confounders), where _disease is Cliff's between disease-positive and control scores.
Results
Audit Verdicts table[H] 4pt Transfer-readiness audit outcome for the canonical saliva periodontitis panel. tab:audit tabular** @ @ Audit gate & Criterion & Observed & Verdict Label provenance & Missing fraction < 0.10 & 0/722 = 0.000 & auditable CV identifiability & K_mix >= 3, _k s_k >= 2 & K_mix = 2, _k s_k = 1 & sparse\_transfer\_unreliable Distributional shift & Ratios [0.67, 1.50] & Both cohorts flagged & shifted\_candidate Model recommendation & Per Eq. eq:recommendation & Gate 2 failed & ridge\_fallback tabular** All 722 retained samples pass the label provenance gate with zero missing fields. However, only 2 primary mixed cohorts survive inclusion criteria (< _K = 3), and each LOCO outer fold contains exactly 1 valid inner mixed-cohort split (< _s = 2). The panel therefore fails the CV identifiability gate. Distributional Shift Diagnostics table[H] 4pt Cohort-shift diagnostics for retained primary cohorts. tab:shift tabular** @ @ Cohort & r_lib & r_nz & d_centroid & Gap count & Flag reasons BP41 & 2.780 & 1.811 & 46.27 & 7/10 & lib-size, nz-features, prevalence BP48 & 1.703 & 1.878 & 46.27 & 6/10 & lib-size, nz-features, prevalence tabular** Both primary cohorts exhibit library-size ratios exceeding the 1.50 threshold and nonzero-feature ratios exceeding the same bound. BP41 is the more extreme case (r_lib = 2.78), while BP48 remains substantially shifted (r_lib = 1.70). Both cohorts trigger all three shift criteria, indicating global panel shift rather than a single outlier. Model Comparison table[H] 4pt Leave-one-cohort-out benchmark results on the retained primary panel. tab:models tabular** @ @ Model & Pooled AUPRC & Macro BAcc & Blind BAcc & Conf.\ Margin & Core Gain No-transfer pooled & 0.935 & 0.836 & — & — & — Ridge baseline & 0.924 & 0.799 & 0.627 & — & — Distilled feature core & 0.908 & 0.775 & 0.620 & 0.662 & +0.041 Full elastic net & 0.897 & 0.785 & 0.627 & 0.620 & +0.041 No-confounder control & 0.897 & 0.785 & — & 0.620 & — tabular** The Ridge baseline achieves the highest pooled AUPRC (0.924), exceeding the full elastic net model by 0.027 and the distilled feature core by 0.016. The performance gap is concentrated in BP48, where the full model drops to 0.832 AUPRC versus 0.932 for the baseline. On BP41, the full model modestly outperforms (0.950 vs.\ 0.940), but the BP48 degradation dominates the pooled metric. Blind cohorts (BP34, BP49), withheld from all tuning, produce near-identical balanced accuracy for both models (0.627 full vs.\ 0.627 abundance), confirming that the sparse model recovers no advantage on unseen data. Confounder Analysis The aggregate confounder suppression margin is = 0.620 for the full model, indicating that disease signal (Cliff's ) substantially exceeds the strongest confounder loading. Age is the most informative available confounder (loading 0.365 in BP41, 0.229 in BP48). Smoking, antibiotics, sequencing platform, and dentition proxy are unavailable in BP41 due to sparse metadata (< 20 non-missing values). The distilled feature core improves the margin to 0.662 (+0.042), but its AUPRC remains below the abundance baseline. Discussion The central finding is negative: the largest publicly available saliva periodontitis panel does not support sparse cross-cohort transfer claims under formal audit conditions. This result is scientifically informative rather than nihilistic—it demonstrates that within-study AUPRC, even when high (> 0.89), does not imply transfer readiness. The audit framework formalizes what is otherwise left implicit in microbiome classifier papers. _K = 3 is not an arbitrary threshold but the structural minimum at which inner cross-validation has more than one comparison point; at K_mix = 2, ``tuning'' reduces to single-point estimation. Sweeping all gate thresholds across 12 configurations ( _K \2,3,4,5`, library-size bounds from [0.50, 2.00] to [0.80, 1.25]), zero produce a passing verdict: at _K = 2 the failure shifts from Gate 2 to Gate 3 (library-size ratios of 2.78x and 1.70x exceed every tested bound). The verdict is structurally determined, not threshold-dependent. Both models converge to 0.627 balanced accuracy on blind cohorts (spread 0.007). The honest conclusion is that no model transfers reliably from this panel; the Ridge baseline recommendation reflects the structural identifiability constraint, not empirical superiority of simpler features. Failure taxonomy across 4 real panels. table[H] 3pt tabularllllccl Panel & Modality & Site & K_mix & Gate 2 & Gate 3 & Verdict EPheClass & 16S & saliva & 2 & FAIL & shifted & unreliable* 3-cohort saliva & 16S & saliva & 3 & pass & **shifted** & shift_blocked* HMP oral & 16S & oral & 3 & pass & stable & transfer\_ready CRC gut & meta & gut & 4 & pass & stable & transfer\_ready tabular Failure taxonomy. EPheClass fails Gate 2 (cohort count). The 3-cohort saliva panel passes Gate 2 but fails Gate 3 (distributional shift). HMP and CRC pass all gates. tab:taxonomy table The 3-cohort saliva panel (USA/Sweden/S. Korea, 160 samples) separates identifiability from shift: K_mix = 3 is adequate, yet the South Korean cohort's distributional shift blocks transfer. HMP oral 16S (603 samples, 4 centers) and CRC gut (575 samples, 5 cohorts) both pass unchanged, confirming oral 16S data are not intrinsically non-transferable. Limitations. (1) The periodontitis panel is restricted to saliva (102 samples, 2 cohorts). (2) The HMP positive control uses a different task (Saliva vs.\ Plaque) than the periodontitis panels (disease vs.\ control). (3) Gate thresholds are fixed; sensitivity analysis shows the verdict is structurally determined (0/12 threshold configurations produce a pass for periodontitis). (4) Confidence intervals are not reported for AUPRC because the LOCO design with 2 folds does not support stable bootstrap estimation—itself evidence of the identifiability problem.
Failure Taxonomy (4 Real Panels)
| Panel | Modality | Site | K_mix | Gate 2 | Gate 3 | Verdict |
|---|---|---|---|---|---|---|
| EPheClass | 16S | saliva | 2 | FAIL | shifted | unreliable |
| 3-cohort saliva | 16S | saliva | 3 | pass | shifted | shift_blocked |
| HMP oral | 16S | oral | 3 | pass | stable | transfer_ready |
| CRC gut | meta→genus | gut | 4 | pass | stable | transfer_ready |
Conclusion
We present a formal, deterministic transfer-readiness audit for microbiome cohorts. The two independent saliva periodontitis panels audited here fail transfer readiness for two separable reasons: insufficient cohort geometry (EPheClass, K_mix = 2) and shift_blocked cross-cohort shift (3-cohort panel, K_mix = 3 but shifted). Conversely, the HMP oral 16S positive control passes unchanged, indicating oral 16S data are not intrinsically non-transferable. The framework discriminates. Published microbiome classifier studies should assess transfer readiness before claiming cross-cohort generalization; the four-gate framework provides an executable specification for that assessment.
References
- Knight R, et al. Nat Rev Microbiol. 2018;16(7):410-422. doi:10.1038/s41579-018-0029-9
- Schloss PD. mBio. 2018;9(3):e00525-18. doi:10.1128/mBio.00525-18
- Aitchison J. J R Stat Soc B. 1982;44(2):139-177. doi:10.1111/j.2517-6161.1982.tb01195.x
- Gloor GB, et al. Front Microbiol. 2017;8:2224. doi:10.3389/fmicb.2017.02224
- Wirbel J, et al. Nat Med. 2019;25(4):679-689. doi:10.1038/s41591-019-0406-6
- HMP Consortium. Nature. 2012;486(7402):207-214. doi:10.1038/nature11234
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
--- name: oral-microbiome-transfer-auditor description: Execute the locked, offline oral microbiome transfer-readiness auditor for saliva-based periodontitis, including public-recovery freeze building, cross-cohort evaluation, cohort-shift diagnostics, baseline recommendation, and supporting benchmark artifacts. allowed-tools: Bash(uv *, python *, python3 *, curl *, ls *, test *, shasum *, unzip *) requires_python: "3.12.x" package_manager: uv repo_root: . canonical_output_dir: outputs/canonical --- # Oral Microbiome Transfer Auditor This skill executes the audit-first transfer-readiness workflow exactly as frozen by the repository contract. It does not invent cohorts, corrected inputs, unverifiable benchmark rows, or fake sample labels. ## Runtime Expectations - Platform: CPU-only - Python: `3.12.x` - Package manager: `uv` - Offline after the freeze bundle exists locally - Canonical freeze directory: `data/benchmark/freeze` - Paper PDF build requires `tectonic` ## Scope Rules - Saliva only in v1 - Adult samples only when age is available - `periodontitis` vs `control` only - `EPheClass` `PD_s` is the canonical abundance backbone - Canonical v1 is ASV-first - No corrected or batch-effect-removed table in the scored path - Blind cohorts are excluded from thresholding, feature selection, hyperparameter selection, confounder-margin tuning, and durable feature-core distillation ## Step 1: Build Or Confirm The Public-Recovery Raw Bundle The freeze builder will create these raw assets from the public `PD_s` backbone if they are absent: - `data/benchmark/raw/epheclass_pd_s_abundance.tsv` - `data/benchmark/raw/recovered_metadata.tsv` - `data/benchmark/raw/recovered_taxonomy.tsv` The source provenance and reconstruction rules are documented in `data/refs/source_provenance.md`. ## Step 2: Install The Locked Environment ```bash uv sync --frozen ``` ## Step 3: Build The Frozen Benchmark ```bash uv run --frozen --no-sync oral-microbiome-benchmark build-freeze --config config/canonical_periodontitis.yaml --out data/benchmark/freeze ``` ## Step 4: Run The Canonical Auditor ```bash uv run --frozen --no-sync oral-microbiome-benchmark run --config config/canonical_periodontitis.yaml --out outputs/canonical ``` The primary outputs are now the audit verdict, model recommendation, and cohort-shift diagnostics. Legacy benchmark metrics remain as supporting evidence. ## Step 5: Verify The Canonical Run ```bash uv run --frozen --no-sync oral-microbiome-benchmark verify --config config/canonical_periodontitis.yaml --run-dir outputs/canonical ``` ## Step 6: Optional Triage Triage v1 is evaluative only and requires a labeled external cohort: ```bash uv run --frozen --no-sync oral-microbiome-benchmark triage --config config/canonical_periodontitis.yaml --input inputs/new_cohort.tsv --metadata inputs/new_metadata.tsv --out outputs/triage ``` ## Step 7: Freeze The Submission Bundle ```bash uv run --frozen --no-sync python scripts/prepare_submission_bundle.py --config config/canonical_periodontitis.yaml --run-dir outputs/canonical ``` This snapshots the verified run into `submission/freeze/source_canonical/`, writes paper-facing tables and figures into `submission/results/`, and regenerates `paper/generated/`. ## Step 8: Build The Paper PDF ```bash uv run --frozen --no-sync python scripts/build_paper_pdf.py --config config/canonical_periodontitis.yaml ``` If `tectonic` is missing, install it with your local package manager first and then rerun Step 8. ## Optional Step 9: Clean-Room Replication ```bash uv run --frozen --no-sync python scripts/create_mini_venv.py --force uv run --frozen --no-sync python scripts/run_replication_check.py --profile smoke --venv-dir .venv-mini uv run --frozen --no-sync python scripts/run_replication_check.py --profile full --venv-dir .venv-mini ``` The smoke profile uses fixture data and checks the end-to-end contract quickly. The full profile reproduces the canonical freeze, run, verify, submission bundle, paper build, and snapshot comparison from local assets only. ## How To Interpret Verdicts - `transfer_ready`: the retained panel supports a non-baseline transfer claim. - `baseline_only_recommended`: the panel is usable, but the safer recommendation is the abundance baseline. - `sparse_transfer_unreliable`: the panel does not support trustworthy sparse tuning. - `insufficient_mixed_cohorts`: too few mixed cohorts remain for canonical transfer scoring. - `unrecoverable_labels`: label provenance fails. - `shifted_candidate`: one or more retained primary cohorts are materially shifted. ## Canonical Success Criteria The canonical scored path is successful only if: - the freeze builder completes without dropping below the blind-panel requirement - the canonical run completes successfully - the verifier exits `0` - all required outputs are present and nonempty - the verifier reports `passed` - the audit bundle contains a top-level verdict and recommended model - if taxonomy is absent, the run still passes honestly with `signature_only` marked `unavailable_missing_taxonomy` - the submission bundle and paper can be rebuilt from the frozen canonical snapshot without manual edits
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.