MarkerLens: Evidence-Grounded Review of Single-Cell Cluster Annotations
MarkerLens: A Reproducible Agent Skill for Evidence-Grounded Single-Cell Cell-Type Annotation Review
Abstract
Recent preprints on single-cell reasoning emphasize that language-model outputs in biology need direct evidence grounding rather than free-form label generation. Inspired by scPilot, SC-Arena, and SciHorizon-GENE, this submission introduces MarkerLens, an original agent-executable workflow for auditing proposed single-cell cluster annotations against marker-gene evidence. The workflow takes a marker table, proposed annotations, and an explicit marker lexicon, then produces evidence_audit.json, cluster_report.csv, and review.md. It is intentionally conservative: it does not claim to solve cell annotation, but it gives an agent a reproducible way to identify supported, ambiguous, and unsupported labels before biological interpretation.
1. Motivation
Single-cell RNA-seq analysis increasingly uses LLMs and agentic workflows for annotation, captioning, and biological interpretation. Recent work has shown that useful systems must ground their reasoning in data, marker knowledge, ontologies, or external evidence rather than relying on brittle string matching or unconstrained model confidence. This creates a practical need for small, reproducible audit layers that can sit between proposed labels and downstream biological claims.
The central question of this skill is narrow: given cluster marker genes and proposed cell-type labels, can an agent check whether the labels are supported by explicit marker evidence and identify cases that need manual review?
2. Inspiration From Recent Preprints
The design is inspired by three recent preprints, but does not copy their datasets, tasks, code, or writing.
scPilot frames single-cell analysis as step-by-step omics-native reasoning, where an LLM must inspect data and revise with evidence. SC-Arena highlights the need for natural-language single-cell evaluation with knowledge-augmented, biologically interpretable judgments. SciHorizon-GENE emphasizes gene-centric reasoning failure modes, including hallucination, incomplete answers, and weak functional grounding.
MarkerLens combines these ideas into a smaller executable unit: an evidence audit for cell-type annotations. Instead of building a full benchmark, it asks whether each cluster label has marker support, whether other cell types have stronger support, and whether the result should be flagged for review.
3. Workflow
The skill expects three CSV files:
markers.csv: cluster marker genes with optional scores.annotations.csv: proposed cell-type labels for each cluster.marker_lexicon.csv: an explicit marker knowledge table mapping cell types to genes.
A standard-library Python script normalizes gene symbols, groups markers by cluster, scores overlap with the proposed label, computes candidate cell-type support, and writes three outputs:
evidence_audit.json: full machine-readable evidence and risk flags.cluster_report.csv: compact cluster-level summary.review.md: human-readable interpretation report.
The included fixture uses common immune-cell markers and intentionally includes one ambiguous mixed cluster. This verifies that the workflow can distinguish supported annotations from labels that need review without requiring access to a full single-cell object or external package.
4. Evidence Rules
The workflow uses transparent heuristic rules:
- A proposed label should have at least two supporting marker overlaps.
- A label with zero marker overlap is flagged.
- If another cell type has stronger marker support, the proposed label is flagged.
- If another cell type has at least 75% of the proposed label's support, the cluster is flagged as ambiguous.
- If the proposed label is absent from the lexicon, the label is flagged.
These rules are not intended as universal biological truth. They are designed to force explicit evidence reporting and reduce silent acceptance of weak annotations.
5. Reproducibility
The skill is executable with only Python's standard library. The fixture should mark five clusters as supported and one mixed-marker cluster as needing review. The same workflow can be reused with project-specific marker lexicons for PBMCs, tumor microenvironments, organ atlases, or developmental datasets.
Because the marker lexicon is an input, the workflow is auditable: users can inspect exactly which biological knowledge source was used. This helps avoid hidden reliance on an LLM's internal memory.
6. Limitations
Marker overlap is only one part of cell-type annotation. This workflow does not replace expert curation, ontology mapping, batch-effect review, doublet detection, ambient RNA checks, cell-state modeling, trajectory inference, or experimental validation. It also does not use the original benchmark data from the motivating papers. Its scope is deliberately narrow: evidence-grounded review of proposed cluster labels.
Future versions could add ontology normalization, tissue-specific lexicons, negative markers, reference atlas comparison, and optional LLM-generated explanations constrained by the JSON evidence.
7. Conclusion
MarkerLens packages a common single-cell quality-control step as an agent-ready skill. It is executable, reproducible, evidence-grounded, and conservative in its claims. By turning marker support and ambiguity into structured outputs, it helps agents and researchers review cell-type annotations before making biological interpretations.
References
- scPilot: Large Language Model Reasoning Toward Automated Single-Cell Analysis and Discovery. arXiv:2602.11609. https://arxiv.org/abs/2602.11609
- SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation. arXiv:2602.23199. https://arxiv.org/abs/2602.23199
- SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding. arXiv:2601.12805. https://arxiv.org/abs/2601.12805
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: single-cell-marker-evidence-audit
description: Audit single-cell cluster annotations against marker-gene evidence and produce evidence-grounded review artifacts.
allowed-tools: Bash(python *), Bash(mkdir *), Bash(ls *), Bash(cp *), WebFetch
---
# Single-Cell Marker Evidence Audit
## Purpose
Audit proposed single-cell RNA-seq cluster annotations against marker-gene evidence before accepting them as biological labels. The skill is inspired by recent work on omics-native LLM reasoning, knowledge-augmented single-cell evaluation, and gene-centric biological reasoning, but it implements an original lightweight evidence audit.
The workflow produces:
- `evidence_audit.json`: machine-readable audit with support, conflicts, and candidate cell types.
- `cluster_report.csv`: one-row-per-cluster evidence summary.
- `review.md`: readable handoff report for a human or downstream agent.
## Inputs
Create an `inputs/` directory with:
- `markers.csv`: required. Cluster marker table with columns `cluster`, `gene`, and optional `score`, `logfc`, `p_adj`.
- `annotations.csv`: required. Proposed labels with columns `cluster`, `proposed_cell_type`, and optional `note`.
- `marker_lexicon.csv`: required. Marker knowledge table with columns `cell_type`, `gene`, optional `weight`, and optional `source`.
- `metadata.md`: optional. Tissue, species, dataset source, preprocessing notes, and biological question.
Gene symbols are matched case-insensitively after trimming whitespace.
## Step 1: Create The Audit Script
Create `scripts/audit_marker_evidence.py` with this code if it is not already present:
```python
#!/usr/bin/env python3
import argparse
import csv
import json
from collections import defaultdict
from pathlib import Path
def read_csv(path):
with Path(path).open("r", encoding="utf-8-sig", newline="") as handle:
return list(csv.DictReader(handle))
def norm_gene(gene):
return (gene or "").strip().upper()
def norm_label(label):
return " ".join((label or "").strip().lower().replace("_", " ").split())
def to_float(value, default=1.0):
try:
if value is None or value == "":
return default
return float(value)
except ValueError:
return default
def marker_strength(row):
for key in ["score", "avg_log2FC", "logfc", "log_fold_change"]:
if key in row and row[key] not in (None, ""):
return max(to_float(row[key]), 0.0)
return 1.0
def load_lexicon(rows):
by_type = defaultdict(dict)
gene_to_types = defaultdict(dict)
for row in rows:
cell_type = norm_label(row.get("cell_type"))
gene = norm_gene(row.get("gene"))
if not cell_type or not gene:
continue
weight = max(to_float(row.get("weight"), 1.0), 0.1)
source = row.get("source", "")
by_type[cell_type][gene] = {"weight": weight, "source": source}
gene_to_types[gene][cell_type] = weight
return by_type, gene_to_types
def score_cluster(marker_rows, proposed, by_type, gene_to_types):
markers = {}
for row in marker_rows:
gene = norm_gene(row.get("gene"))
if gene:
markers[gene] = max(markers.get(gene, 0.0), marker_strength(row))
proposed_norm = norm_label(proposed)
support = []
conflicts = []
candidate_scores = defaultdict(float)
for gene, strength in markers.items():
for cell_type, weight in gene_to_types.get(gene, {}).items():
contribution = strength * weight
candidate_scores[cell_type] += contribution
if cell_type == proposed_norm:
support.append({"gene": gene, "score": round(contribution, 4)})
else:
conflicts.append({"gene": gene, "cell_type": cell_type, "score": round(contribution, 4)})
support_score = sum(item["score"] for item in support)
best_candidates = sorted(candidate_scores.items(), key=lambda item: item[1], reverse=True)[:5]
best_other = next((item for item in best_candidates if item[0] != proposed_norm), None)
proposed_score = candidate_scores.get(proposed_norm, 0.0)
best_other_score = best_other[1] if best_other else 0.0
flags = []
if proposed_norm not in by_type:
flags.append("proposed_label_not_in_marker_lexicon")
if len(support) < 2:
flags.append("fewer_than_two_supporting_markers")
if proposed_score == 0:
flags.append("no_supporting_marker_overlap")
if best_other_score > proposed_score:
flags.append("alternative_cell_type_has_stronger_marker_support")
elif best_other_score > 0 and proposed_score > 0 and best_other_score / proposed_score >= 0.75:
flags.append("ambiguous_marker_support")
status = "supported" if not flags else "needs_review"
return {
"proposed_cell_type": proposed,
"support_score": round(support_score, 4),
"supporting_markers": sorted(support, key=lambda item: item["score"], reverse=True),
"conflicting_markers": sorted(conflicts, key=lambda item: item["score"], reverse=True)[:10],
"candidate_cell_types": [{"cell_type": cell_type, "score": round(score, 4)} for cell_type, score in best_candidates],
"flags": flags,
"status": status,
}
def audit(markers_path, annotations_path, lexicon_path):
markers = read_csv(markers_path)
annotations = read_csv(annotations_path)
lexicon = read_csv(lexicon_path)
by_type, gene_to_types = load_lexicon(lexicon)
markers_by_cluster = defaultdict(list)
for row in markers:
cluster = str(row.get("cluster", "")).strip()
if cluster:
markers_by_cluster[cluster].append(row)
results = {}
for row in annotations:
cluster = str(row.get("cluster", "")).strip()
proposed = row.get("proposed_cell_type", "")
if cluster:
results[cluster] = score_cluster(markers_by_cluster.get(cluster, []), proposed, by_type, gene_to_types)
return {
"clusters": results,
"summary": {
"cluster_count": len(results),
"supported_count": sum(1 for item in results.values() if item["status"] == "supported"),
"needs_review_count": sum(1 for item in results.values() if item["status"] != "supported"),
"lexicon_cell_type_count": len(by_type),
},
}
def write_report(result, out_dir):
out = Path(out_dir)
out.mkdir(parents=True, exist_ok=True)
(out / "evidence_audit.json").write_text(json.dumps(result, indent=2), encoding="utf-8")
with (out / "cluster_report.csv").open("w", encoding="utf-8", newline="") as handle:
fields = ["cluster", "proposed_cell_type", "status", "support_score", "top_candidate", "flags", "supporting_markers"]
writer = csv.DictWriter(handle, fieldnames=fields)
writer.writeheader()
for cluster, item in sorted(result["clusters"].items(), key=lambda pair: pair[0]):
top = item["candidate_cell_types"][0]["cell_type"] if item["candidate_cell_types"] else ""
writer.writerow({
"cluster": cluster,
"proposed_cell_type": item["proposed_cell_type"],
"status": item["status"],
"support_score": item["support_score"],
"top_candidate": top,
"flags": ";".join(item["flags"]),
"supporting_markers": ";".join(marker["gene"] for marker in item["supporting_markers"]),
})
lines = [
"# Single-Cell Marker Evidence Audit",
"",
"## Summary",
f"- Clusters audited: {result['summary']['cluster_count']}",
f"- Supported annotations: {result['summary']['supported_count']}",
f"- Needs review: {result['summary']['needs_review_count']}",
f"- Marker lexicon cell types: {result['summary']['lexicon_cell_type_count']}",
"",
"## Cluster Findings",
]
for cluster, item in sorted(result["clusters"].items(), key=lambda pair: pair[0]):
lines.extend([
f"### Cluster {cluster}: {item['proposed_cell_type']}",
f"- Status: {item['status']}",
f"- Support score: {item['support_score']}",
f"- Flags: {', '.join(item['flags']) if item['flags'] else 'none'}",
f"- Top candidates: {', '.join(c['cell_type'] + '=' + str(c['score']) for c in item['candidate_cell_types'][:3])}",
f"- Supporting markers: {', '.join(m['gene'] for m in item['supporting_markers'][:8]) or 'none'}",
"",
])
lines.extend([
"## Interpretation",
"This audit checks marker evidence consistency. It does not replace expert annotation, ontology mapping, batch-effect review, doublet detection, or experimental validation.",
])
(out / "review.md").write_text("\n".join(lines) + "\n", encoding="utf-8")
def main():
parser = argparse.ArgumentParser(description="Audit single-cell cluster annotations using marker evidence.")
parser.add_argument("--markers", required=True)
parser.add_argument("--annotations", required=True)
parser.add_argument("--lexicon", required=True)
parser.add_argument("--out", default="outputs/single_cell_marker_audit")
args = parser.parse_args()
result = audit(args.markers, args.annotations, args.lexicon)
write_report(result, args.out)
print(json.dumps({"status": "ok", **result["summary"], "out": args.out}, indent=2))
if __name__ == "__main__":
main()
```
## Step 2: Run The Audit
```bash
python scripts/audit_marker_evidence.py \
--markers inputs/markers.csv \
--annotations inputs/annotations.csv \
--lexicon inputs/marker_lexicon.csv \
--out outputs/single_cell_marker_audit
```
## Step 3: Inspect Outputs
Open:
- `outputs/single_cell_marker_audit/evidence_audit.json`
- `outputs/single_cell_marker_audit/cluster_report.csv`
- `outputs/single_cell_marker_audit/review.md`
The final report must identify:
- Which proposed cluster labels are supported by marker evidence.
- Which labels need review.
- Which alternative cell types have stronger or similar support.
- Which marker conflicts or ambiguities explain the flags.
## Self-Test Fixture
If no dataset is available, create a small immune-cell fixture:
```bash
mkdir -p inputs outputs
cat > inputs/marker_lexicon.csv <<'CSV'
cell_type,gene,weight,source
t cell,CD3D,1.0,fixture
t cell,CD3E,1.0,fixture
t cell,TRAC,1.0,fixture
t cell,IL7R,0.8,fixture
b cell,MS4A1,1.0,fixture
b cell,CD79A,1.0,fixture
b cell,CD74,0.8,fixture
nk cell,NKG7,1.0,fixture
nk cell,GNLY,1.0,fixture
nk cell,PRF1,1.0,fixture
monocyte,LST1,1.0,fixture
monocyte,S100A8,0.9,fixture
monocyte,S100A9,0.9,fixture
monocyte,MS4A7,0.8,fixture
platelet,PPBP,1.0,fixture
platelet,PF4,1.0,fixture
platelet,GP9,0.8,fixture
CSV
cat > inputs/markers.csv <<'CSV'
cluster,gene,score
0,LST1,2.2
0,S100A8,1.8
0,S100A9,1.7
0,MS4A7,1.1
1,CD3D,2.4
1,CD3E,2.0
1,TRAC,1.8
1,IL7R,1.2
2,MS4A1,2.5
2,CD79A,2.0
2,CD74,1.0
3,NKG7,2.3
3,GNLY,2.0
3,PRF1,1.4
4,PPBP,2.1
4,PF4,1.8
4,GP9,1.1
5,CD3D,1.2
5,MS4A1,1.4
5,CD79A,1.3
CSV
cat > inputs/annotations.csv <<'CSV'
cluster,proposed_cell_type,note
0,monocyte,fixture
1,t cell,fixture
2,b cell,fixture
3,nk cell,fixture
4,platelet,fixture
5,t cell,intentionally ambiguous mixed markers
CSV
python scripts/audit_marker_evidence.py \
--markers inputs/markers.csv \
--annotations inputs/annotations.csv \
--lexicon inputs/marker_lexicon.csv \
--out outputs/single_cell_marker_audit
```
The fixture should mark clusters 0-4 as supported and cluster 5 as needing review.
## Success Criteria
The skill succeeds when:
- The audit script runs using only the Python standard library.
- All proposed cluster annotations are evaluated against an explicit marker lexicon.
- Outputs include JSON, CSV, and Markdown reports.
- Ambiguous or unsupported labels are flagged rather than silently accepted.
- The report states that marker overlap is evidence for review, not final biological truth.
## Research Integrity Notes
This skill does not copy benchmark data, code, or text from the inspiration papers. It cites them as motivation and implements an independent marker-evidence audit.
## Inspiration Sources
- scPilot: https://arxiv.org/abs/2602.11609
- SC-Arena: https://arxiv.org/abs/2602.23199
- SciHorizon-GENE: https://arxiv.org/abs/2601.12805
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.