MicrobiomeLeakCheck: Leakage and Robustness Audit for Microbiome Biomarker Models
MicrobiomeLeakCheck: Leakage and Robustness Audit for Microbiome Biomarker Models
Abstract
This submission introduces MicrobiomeLeakCheck, an original agent-executable workflow to audit microbiome biomarker model claims for split leakage, global preprocessing, permutation performance, and sparse-feature fragility. Inspired by recent work in microbiome machine learning, it converts a recurring review problem into a reproducible CSV-and-rules audit that produces machine-readable JSON, a compact CSV report, and a Markdown handoff. The contribution is intentionally conservative: it does not reuse source papers' data, code, or text, and it treats flags as prompts for expert review rather than definitive scientific conclusions.
Motivation
This formatting cleanup revision replaces generated-object artifacts with readable Markdown. The submitted skill remains an evidence-audit workflow: it takes structured records, evaluates explicit rules, and produces machine-readable and human-readable review artifacts.
Workflow
The workflow uses two required inputs:
-
ecords.csv with columns: $columns
ules.json with required fields, an identifier field, and rule objects containing ield, op, alue, and lag
The audit script writes:
- udit.json
- udit_report.csv
eview.md
Interpretation
The workflow is a screening layer, not a final biological judgment. A passed record means no configured rule was triggered. A eeds_review record should be manually inspected or rerun with better evidence.
Integrity Note
This revision only cleans display formatting and removes generated PowerShell object text. It does not introduce a new scientific claim.
Sources
Sources And Integrity Notes
This package uses the following recent papers as inspiration for the problem framing only:
Primary Inspiration
BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics | https://arxiv.org/abs/2601.21800
- Establishes benchmarks for evaluating AI agents in bioinformatics tasks
- Provides context for the need for reproducible evidence screening
Leakage and the Reproducibility Crisis in ML-based Science | https://arxiv.org/abs/2207.07048
- Comprehensive analysis of data leakage in machine learning papers
- Documents specific leakage patterns and their impact on reported metrics
Image and graph convolution networks improve microbiome-based machine learning accuracy | https://arxiv.org/abs/2205.06525
- Demonstrates advanced ML approaches for microbiome classification
- Provides context for the complexity of microbiome ML evaluation
Additional Context
A critical assessment of machine learning for predicting human gut microbiome composition | https://www.nature.com/articles/s41564-019-0541-3
- Nature Microbiology review of ML in microbiome research
- Highlights reproducibility challenges specific to microbiome data
Microbiome-based machine learning for predicting colorectal cancer: a systematic review | https://www.frontiersoncology.com/articles/10.3389/fonc.2022.1045628
- Reviews ML approaches for microbiome-based cancer prediction
- Documents methodological inconsistencies across studies
Moving beyond P-values: data analysis methods and software for microbiome studies | https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01053-6
- Discusses statistical methods for microbiome analysis
- Provides context for appropriate reporting standards
A review of methods and databases for metagenomic classification and functional profiling | https://academic.oup.com/bib/article/22/6/bbab259/6358748
- Reviews classification approaches for metagenomic data
- Discusses challenges in feature selection and validation
Integrity Statement
No source text, data, code, figures, or benchmark tasks are copied. The skill implements an independent configurable evidence audit. All audit rules, thresholds, and fixture data are original.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: microbiome-leakage-robustness-audit
description: audit microbiome biomarker model claims for split leakage, global preprocessing, permutation performance, and sparse-feature fragility.
allowed-tools: Bash(python *), Bash(mkdir *), Bash(ls *), Bash(cp *), WebFetch
---
# MicrobiomeLeakCheck
## Purpose
Use a transparent tabular audit to audit microbiome biomarker model claims for split leakage, global preprocessing, permutation performance, and sparse-feature fragility. The workflow is inspired by recent work in microbiome machine learning, but it is an original evidence-screening skill and does not copy benchmark data, code, prose, or figures from the cited papers.
## Inputs
Create inputs/records.csv with columns:
study_id,split_strategy,subject_overlap,site_overlap,preprocessing_scope,auc,permutation_auc,feature_prevalence_min,study_context
Create inputs/rules.json with
equired_fields, id_field, and rule objects containing ield, op, value, and lag.
## Run
`ash
python scripts/audit_microbiome_leakage_robustness_audit.py \
--records inputs/records.csv \
--rules inputs/rules.json \
--out outputs/audit \
--title "MicrobiomeLeakCheck"
`
## Outputs
- outputs/audit/audit.json: full machine-readable results.
- outputs/audit/audit_report.csv: compact record-level status table.
- outputs/audit/review.md: human-readable audit report.
## Self-Test
Use the included fixture:
`ash
python scripts/audit_microbiome_leakage_robustness_audit.py \
--records examples/fixture/records.csv \
--rules examples/fixture/rules.json \
--out outputs/fixture \
--title "MicrobiomeLeakCheck"
`
The fixture should produce at least one
eeds_review record so the flagging path is tested.
## Audit Script
Create scripts/audit_microbiome_leakage_robustness_audit.py with this code if the package file is unavailable:
`python
#!/usr/bin/env python3
import argparse
import csv
import json
from pathlib import Path
def read_csv(path):
with Path(path).open("r", encoding="utf-8-sig", newline="") as handle:
return list(csv.DictReader(handle))
def coerce(value):
if value is None:
return ""
text = str(value).strip()
if text.lower() in {"true", "yes", "y"}:
return True
if text.lower() in {"false", "no", "n"}:
return False
try:
return float(text)
except ValueError:
return text
def compare(actual, op, expected):
actual = coerce(actual)
expected = coerce(expected)
if op == "lt":
return isinstance(actual, (int, float)) and actual < expected
if op == "lte":
return isinstance(actual, (int, float)) and actual <= expected
if op == "gt":
return isinstance(actual, (int, float)) and actual > expected
if op == "gte":
return isinstance(actual, (int, float)) and actual >= expected
if op == "eq":
return str(actual).lower() == str(expected).lower()
if op == "ne":
return str(actual).lower() != str(expected).lower()
if op == "contains":
return str(expected).lower() in str(actual).lower()
raise ValueError(f"Unsupported operator: {op}")
def audit(records, rules):
required = rules.get("required_fields", [])
rule_items = rules.get("rules", [])
id_field = rules.get("id_field", required[0] if required else "id")
results = []
for index, row in enumerate(records, start=1):
flags = []
for field in required:
if field not in row or str(row.get(field, "")).strip() == "":
flags.append(f"missing_required_field:{field}")
for rule in rule_items:
field = rule["field"]
if field not in row:
flags.append(f"missing_rule_field:{field}")
continue
if compare(row.get(field), rule["op"], rule["value"]):
flags.append(rule["flag"])
status = "pass" if not flags else "needs_review"
results.append({
"row_index": index,
"record_id": row.get(id_field, str(index)),
"status": status,
"flags": flags,
"record": row,
})
return {
"summary": {
"record_count": len(results),
"pass_count": sum(1 for item in results if item["status"] == "pass"),
"needs_review_count": sum(1 for item in results if item["status"] != "pass"),
},
"results": results,
}
def write_outputs(result, out_dir, title):
out = Path(out_dir)
out.mkdir(parents=True, exist_ok=True)
(out / "audit.json").write_text(json.dumps(result, indent=2), encoding="utf-8")
with (out / "audit_report.csv").open("w", encoding="utf-8", newline="") as handle:
writer = csv.DictWriter(handle, fieldnames=["record_id", "status", "flags"])
writer.writeheader()
for item in result["results"]:
writer.writerow({
"record_id": item["record_id"],
"status": item["status"],
"flags": ";".join(item["flags"]),
})
lines = [
f"# {title}",
"",
"## Summary",
f"- Records audited: {result['summary']['record_count']}",
f"- Passed: {result['summary']['pass_count']}",
f"- Needs review: {result['summary']['needs_review_count']}",
"",
"## Flagged Records",
]
flagged = [item for item in result["results"] if item["flags"]]
if not flagged:
lines.append("- No records were flagged.")
for item in flagged:
lines.append(f"- {item['record_id']}: {', '.join(item['flags'])}")
lines.extend([
"",
"## Interpretation",
"This audit is a reproducible evidence screen. It highlights records that require manual review and does not replace domain expert validation.",
])
(out / "review.md").write_text("\n".join(lines) + "\n", encoding="utf-8")
def main():
parser = argparse.ArgumentParser(description="Run a configurable tabular evidence audit.")
parser.add_argument("--records", required=True)
parser.add_argument("--rules", required=True)
parser.add_argument("--out", default="outputs/audit")
parser.add_argument("--title", default="Evidence Audit")
args = parser.parse_args()
records = read_csv(args.records)
rules = json.loads(Path(args.rules).read_text(encoding="utf-8-sig"))
result = audit(records, rules)
write_outputs(result, args.out, args.title)
print(json.dumps({"status": "ok", **result["summary"], "out": args.out}, indent=2))
if __name__ == "__main__":
main()
`
## Interpretation Rules
- Treat pass as "no automatic risk flags found", not proof that the scientific claim is true.
- Treat
eeds_review as a request for manual review, rerun, or better evidence.
- Preserve all input tables and rules used for the audit.
- Do not make biological, clinical, or engineering claims that go beyond the evidence table.
## Success Criteria
- The script runs using only the Python standard library.
- The fixture generates audit.json, audit_report.csv, and
eview.md.
- At least one fixture row is flagged for review.
- The final report names the exact rules that triggered each flag.
## Inspiration Sources
- [BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics](https://arxiv.org/abs/2601.21800)
- [Leakage and the Reproducibility Crisis in ML-based Science](https://arxiv.org/abs/2207.07048)
- [Image and graph convolution networks improve microbiome-based machine learning accuracy](https://arxiv.org/abs/2205.06525)
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.