LitPath: An Executable Skill for Literature-Driven Target Discovery and Pathway Evidence Synthesis
LitPath: An Executable Skill for Literature-Driven Target Discovery and Pathway Evidence Synthesis
Abstract
Biological literature synthesis for therapeutic target identification remains a manual, time-consuming process with limited reproducibility. Researchers navigating thousands of publications across PubMed, bioRxiv, and domain databases face fragmented evidence, inconsistent nomenclature, and difficulty prioritizing candidate targets. We present LitPath, an executable agent skill that transforms literature review into a structured, reproducible workflow. Given a disease, gene, pathway, or biological theme, LitPath systematically retrieves literature, extracts and organizes mechanistic evidence, ranks candidate targets by confidence, and produces a structured synthesis report with supporting citations, conflicting evidence flags, and suggested downstream analyses. The skill operates through a seven-stage pipeline: query interpretation, entity expansion, literature retrieval, evidence extraction, pathway-mechanism organization, confidence labeling, and structured report generation. Each stage is defined by explicit inputs, outputs, and validation criteria, enabling evaluation of the skill independently of the underlying language model. We describe three representative use cases — pathway analysis in pancreatic cancer, ferroptosis evidence synthesis in glioblastoma, and comparative gene prioritization in inflammatory bowel disease — and propose a multi-axis evaluation framework covering completeness, factual consistency, biological relevance, and reproducibility. LitPath is designed as a reusable, evaluable skill for bioinformatics agents rather than a monolithic application, and is intended to support research prioritization, not clinical decision-making.
1. Introduction
The volume of biomedical literature has grown beyond what any individual researcher can synthesize manually. PubMed indexes over 36 million citations, with tens of thousands added monthly. For a researcher investigating therapeutic targets in a specific disease or interrogating a biological pathway, the relevant evidence is scattered across primary research articles, review papers, preprints, and curated databases. The result is a persistent bottleneck: even experienced bioinformaticians spend weeks compiling literature evidence for target nomination, pathway delineation, or hypothesis generation, and the resulting syntheses are difficult to reproduce or compare across studies.
Several factors compound this problem. Biological nomenclature is ambiguous — a single gene may have dozens of aliases, and pathway definitions vary across databases. Evidence quality is heterogeneous, ranging from well-powered clinical studies to single-cell-line observations. Conventional literature search tools retrieve documents but do not organize their claims by mechanism, evidence strength, or consistency with other findings. Large language models can summarize text but lack structured workflows for systematic retrieval, deduplication, and confidence assessment.
The emergence of agent-based computing platforms offers a different approach. Rather than treating literature synthesis as an unstructured conversation, it can be decomposed into discrete, auditable steps: query interpretation, entity resolution, retrieval, extraction, organization, ranking, and reporting. Each step has defined inputs and outputs. The overall workflow is executable, meaning it can be run, re-run, and evaluated on the same or different inputs. This aligns with the Claw4S conference's emphasis on skills — reusable, evaluable agent workflows — over static publications.
We present LitPath, a literature-driven bioinformatics skill for target discovery and pathway evidence synthesis. LitPath is not a search engine or a chatbot. It is a structured, multi-stage pipeline that takes a biological query as input and produces a ranked, cited, and confidence-labeled evidence synthesis as output. The skill is designed to be model-agnostic at the workflow level (the pipeline structure is fixed) while allowing language model substitution at the extraction and synthesis stages. This separation makes the skill itself the primary object of evaluation and reuse, independent of any specific model.
2. Skill Definition and Problem Formulation
2.1 Input Specification
LitPath accepts a structured query with the following fields:
| Field | Required | Example |
|---|---|---|
primary_entity |
Yes | Disease name, gene symbol, pathway name, or biological process |
species |
Yes | Species or taxonomic constraint (e.g., Homo sapiens) |
focus |
No | One of: mechanism, biomarkers, druggability, genetic_association, general |
secondary_entities |
No | Additional genes, pathways, or phenotypes to constrain the search |
time_range |
No | Publication year range for literature retrieval |
max_papers |
No | Upper bound on papers to process (default: 100) |
output_format |
No | Report structure variant (default: full) |
A minimal valid query consists of a primary entity and species. For example: {"primary_entity": "pancreatic ductal adenocarcinoma", "species": "Homo sapiens", "focus": "mechanism"}.
2.2 Output Specification
The skill produces a structured report containing the following components:
- Query Interpretation Summary: The canonical form of the input, including resolved synonyms, MeSH terms, and ontology identifiers (e.g., DOID, GO, Reactome IDs).
- Candidate Targets: A ranked list of genes or proteins, each annotated with:
- Mechanistic role (driver, suppressor, modifier, marker)
- Evidence count and type (genetic, pharmacological, expression-based, clinical)
- Confidence label (strong, moderate, weak, speculative)
- Key supporting citations (PMID or DOI)
- Known conflicts or contradictory evidence
- Pathway Summary: Organized by pathway, listing member genes, direction of dysregulation, and supporting evidence.
- Evidence Matrix: A structured table mapping entities to evidence types and sources.
- Open Questions: Biological ambiguities, underexplored mechanisms, and gaps in the literature identified during synthesis.
- Suggested Analyses: Concrete downstream computational or experimental analyses (e.g., "perform TCGA expression analysis of top 10 candidates," "run GSEA on KRAS-pathway genes using identified expression signatures").
- Metadata: Retrieval parameters, literature coverage statistics, processing timestamp, and model version used.
2.3 Why This Is a Skill Task
Literature-driven target synthesis is well-suited to the skill paradigm for three reasons. First, it has clear boundaries: a defined input (a biological query), a defined output (a structured evidence report), and measurable quality criteria (citation accuracy, completeness, consistency). Second, it requires multi-step orchestration across heterogeneous tools — entity resolution via NCBI or UniProt, retrieval via PubMed or Semantic Scholar, pathway lookup via Reactome or KEGG — which is precisely the capability that agent workflows provide. Third, the task is recurrent: researchers across institutions perform essentially the same type of synthesis repeatedly, with different queries but the same structural requirements. A reusable skill captures this common structure.
This is not a task well-served by a single prompt to a language model. The entity expansion step requires database lookups. The retrieval step requires API calls with pagination. The confidence labeling step requires cross-referencing claims across papers. Each of these is a distinct operation with its own failure modes and validation needs. Decomposing the task into a pipeline makes each stage testable and improvable independently.
3. Method: Agent Workflow
LitPath executes through seven sequential stages. Each stage has a defined contract: specified inputs, outputs, and validation checks. We describe each stage below.
Stage 1: Query Interpretation
The agent parses the input query into a canonical form. This involves:
- Normalizing the primary entity against biomedical ontologies (Disease Ontology for diseases, HGNC for genes, Gene Ontology for processes, Reactome for pathways).
- Resolving synonyms, historical names, and common misspellings (e.g., "PDAC" → "pancreatic ductal adenocarcinoma" → DOID:8647).
- Mapping the species to its NCBI taxonomy ID.
- Translating the optional focus field into retrieval filter parameters.
Output: A canonicalized query object with standardized identifiers, synonym lists, and MeSH term mappings.
Validation: The agent verifies that at least one ontology identifier resolves for the primary entity. If resolution fails, the skill returns a diagnostic error rather than proceeding with an ambiguous query.
Stage 2: Entity Expansion
Using the canonicalized query, the agent expands the search space by:
- Retrieving known aliases and related terms from NCBI Gene, UniProt, and the Universal Protein Resource.
- For pathway queries, fetching member genes and pathway hierarchy from Reactome, KEGG, or WikiPathways.
- For disease queries, identifying associated genes from DisGeNET, Open Targets, or the GWAS Catalog.
- Generating a comprehensive search term list that combines primary entities, expanded synonyms, and focus-specific modifiers.
Output: An expanded entity set with provenance tracking (which source contributed which synonym or related gene).
Validation: Expansion is bounded. The agent limits the expanded entity set to a configurable maximum (default: 200 entities) and logs truncation events.
Stage 3: Literature Retrieval
The agent queries multiple literature sources using the expanded search terms:
- PubMed: Boolean queries using expanded MeSH terms and entity synonyms. Retrieval via the NCBI E-utilities API.
- Semantic Scholar: Semantic search using the primary entity and focus terms. Provides citation context and influential citation counts.
- bioRxiv / medRxiv: Preprint retrieval for recent, unpublished findings, flagged separately from peer-reviewed evidence.
- Cochrane Library: For clinical evidence, if the focus includes therapeutic interventions.
Retrieval is stratified: the agent first fetches review articles and meta-analyses (which provide structured overviews), then primary research articles, then preprints. Deduplication is performed across sources using DOI matching and title similarity.
Output: A ranked corpus of papers with metadata (PMID, DOI, title, authors, year, journal, citation count, article type).
Validation: The agent checks retrieval completeness by comparing the number of returned papers against expected yields from prior queries on similar entities. It flags potential coverage gaps.
Stage 4: Evidence Extraction
For each retrieved paper, the agent extracts:
- Primary claims: The main findings relevant to the query entities, extracted from the abstract and (when available) full text.
- Entity mentions: Occurrences of genes, proteins, pathways, and diseases, mapped to standardized identifiers.
- Evidence type: Categorized as genetic (e.g., GWAS, knockout), pharmacological (e.g., inhibitor studies), expression-based (e.g., RNA-seq, microarray), clinical (e.g., trial outcomes), or computational (e.g., network analysis, in silico prediction).
- Direction of effect: Upregulation, downregulation, activation, inhibition, or association without directional evidence.
- Study context: Model system (cell line, animal model, patient cohort), sample size, and study design quality indicators.
Extraction is performed using a language model prompted with entity-aware extraction templates. The agent processes papers in batches to manage context windows, using abstract-only processing for low-ranked papers and full-text processing for high-ranked papers.
Output: A structured evidence record per paper, with extracted claims, entity mappings, and metadata.
Validation: The agent performs self-consistency checks: extracted claims must reference entities present in the expanded entity set. Claims that reference unmapped entities are flagged for manual review rather than discarded.
Stage 5: Pathway and Mechanism Organization
The agent aggregates extracted evidence into a pathway-centric structure:
- Groups evidence by pathway (using Reactome, KEGG, or GO hierarchy) and by mechanism (e.g., "DNA damage repair," "immune evasion," "metabolic reprogramming").
- For each pathway, constructs an evidence network mapping genes to their reported roles and the papers supporting each role.
- Identifies convergence points: genes or mechanisms supported by multiple independent lines of evidence.
- Identifies conflicts: cases where the same gene or pathway has contradictory reported roles across studies, with explicit documentation of the conflicting claims and their sources.
Output: A pathway-mechanism map linking genes, evidence, and literature sources, with conflict annotations.
Validation: The agent verifies that the pathway map covers at least 80% of the candidate targets identified during extraction. Unmapped targets are reported as pathway-independent candidates.
Stage 6: Evidence Ranking and Confidence Labeling
Each candidate target and mechanistic claim is assigned a confidence label based on a composite score derived from:
- Evidence volume: Number of independent papers supporting the claim (weighted by citation count and journal impact factor as a proxy for peer-review rigor).
- Evidence diversity: Number of distinct evidence types (genetic, pharmacological, expression-based, clinical). A claim supported by genetic and pharmacological evidence is rated higher than one supported by expression data alone.
- Consistency: Agreement across studies. Conflicting evidence reduces the confidence label.
- Recency: More recent findings are noted, though older foundational studies retain weight.
The labeling scheme is:
| Label | Criteria |
|---|---|
| Strong | ≥5 independent papers, ≥2 evidence types, no unresolved conflicts, includes at least one clinical or in vivo study |
| Moderate | ≥3 independent papers, ≥1 evidence type, conflicts documented but not disqualifying |
| Weak | 1–2 papers, single evidence type, or preprint-only evidence |
| Speculative | Indirect evidence, computational predictions without experimental validation, or single-case reports |
Output: A ranked candidate list with confidence labels and supporting evidence summaries.
Validation: The agent checks that every confidence label is justified by the underlying evidence counts and types. Labels without sufficient support are downgraded.
Stage 7: Structured Report Generation
The agent assembles the outputs of all preceding stages into a coherent report following the output specification defined in Section 2.2. The report is generated in Markdown with structured sections, tables, and inline citations. Each citation includes the PMID or DOI for verification.
The report includes a methods section documenting the exact query parameters, entity expansion results, retrieval statistics (number of papers retrieved per source, after deduplication), and any deviations from the standard pipeline.
Output: The final structured evidence synthesis report.
Validation: The agent performs a final consistency check: all citations in the report must resolve to papers in the retrieved corpus. The report structure must conform to the output specification schema.
4. Example Use Cases
4.1 Pathway Analysis in Pancreatic Ductal Adenocarcinoma
Input: {"primary_entity": "pancreatic ductal adenocarcinoma", "species": "Homo sapiens", "focus": "mechanism"}
Expected output: The agent resolves PDAC to DOID:8647, expands to include synonyms and MeSH terms, retrieves literature spanning KRAS signaling, TP53 loss, SMAD4 inactivation, stromal interactions, immune evasion, and metabolic reprogramming. It identifies KRAS, TP53, CDKN2A, and SMAD4 as strong-confidence driver targets based on extensive genetic evidence across TCGA, ICGC, and individual cohort studies. It identifies emerging targets such as CXCR4, hedgehog pathway components, and autophagy regulators at moderate confidence. It documents the conflicting evidence around stromal depletion strategies. The suggested analyses section recommends TCGA-based expression profiling of moderate-confidence targets and single-cell RNA-seq meta-analysis of tumor microenvironment subtypes.
4.2 Ferroptosis Evidence Synthesis in Glioblastoma
Input: {"primary_entity": "ferroptosis", "species": "Homo sapiens", "focus": "mechanism", "secondary_entities": ["glioblastoma"]}
Expected output: The agent retrieves ferroptosis literature linked to glioblastoma, identifies GPX4, SLC7A11, ACSL4, and FSP1 as core ferroptosis regulators with strong evidence. It maps the interplay between ferroptosis and glioblastoma-specific mechanisms including EGFR amplification, IDH1 mutation status, and TMZ resistance. It identifies gaps: limited in vivo evidence for ferroptosis inducers in orthotopic GBM models, and contradictory reports on the role of p53 in modulating ferroptosis sensitivity in glioma cells. The open questions section highlights the unresolved relationship between ferroptosis and immune activation in the GBM microenvironment. Suggested analyses include correlation of ferroptosis gene signatures with TCGA GBM patient survival and drug sensitivity screening of ferroptosis inducers against GBM cell line panels.
4.3 Comparative Gene Prioritization in Inflammatory Bowel Disease
Input: {"primary_entity": "inflammatory bowel disease", "species": "Homo sapiens", "focus": "genetic_association", "secondary_entities": ["NOD2", "IL23R", "ATG16L1", "TNF", "IL10"]}
Expected output: The agent retrieves genetic association studies, GWAS meta-analyses, and functional validation papers for each candidate gene. It produces a comparative evidence matrix ranking the five genes by association strength, replication consistency, functional evidence, and druggability. NOD2 emerges as the strongest genetic association with extensive replication but limited druggability. TNF shows strong clinical evidence (anti-TNF therapy) but weaker genetic association effect sizes. IL23R shows emerging druggability via ustekinumab-related pathway targeting. The report flags the polygenic architecture challenge and recommends polygenic risk score construction incorporating all five loci, followed by pathway enrichment analysis against the full IBD GWAS locus set.
5. Evaluation Plan
We propose a multi-axis evaluation framework. Because LitPath is a skill rather than a deployed product, evaluation focuses on the workflow's output quality across a standardized benchmark set.
5.1 Benchmark Construction
We propose constructing a benchmark of 20 biological queries spanning diverse diseases, pathways, and entity types. Each query will have a reference synthesis prepared by domain experts, including a gold-standard candidate target list and evidence mapping.
5.2 Evaluation Dimensions
| Dimension | Metric | Assessment Method |
|---|---|---|
| Completeness | Recall of key papers and candidate targets against reference synthesis | Automated: compare retrieved PMIDs and candidate gene lists against gold standard |
| Factual consistency | Accuracy of extracted claims against source papers | Automated + human: sample extracted claims and verify against full text |
| Citation validity | Fraction of citations that correctly reference existing papers and accurately represent their findings | Automated: DOI/PMID resolution + human spot-check |
| Biological relevance | Utility of ranked targets and pathway summaries for hypothesis generation | Human: domain experts rate relevance on a 1–5 scale |
| Report structure quality | Adherence to output specification, clarity of presentation, absence of formatting errors | Automated: schema validation + human readability assessment |
| Reproducibility | Consistency of outputs across repeated runs with identical inputs | Automated: run each query 3 times, measure Jaccard similarity of candidate sets and citation sets |
| Confidence calibration | Agreement between assigned confidence labels and independent expert ratings | Human: experts assign blind confidence labels, compare to agent labels |
5.3 Human Evaluation Protocol
For each benchmark query, two domain experts (bioinformatics researchers with relevant disease/pathway expertise) will independently evaluate the agent's output. They will assess:
- Whether any important papers or targets were missed.
- Whether extracted claims accurately represent the source literature.
- Whether confidence labels reflect the true evidence strength.
- Whether the suggested analyses are feasible and relevant.
Inter-rater agreement will be measured using Cohen's kappa. Disagreements will be resolved by a third expert.
5.4 Automated Checks
The following checks can be automated and run on every execution:
- All cited PMIDs resolve to valid PubMed entries.
- No duplicate citations in the report.
- All candidate genes map to valid HGNC symbols.
- Pathway annotations correspond to valid Reactome or KEGG pathway IDs.
- Confidence labels are consistent with underlying evidence counts (per the criteria in Section 3, Stage 6).
6. Limitations and Risks
6.1 Hallucinated Citations
Language models may generate plausible-sounding but nonexistent citations. LitPath mitigates this by requiring all citations to be validated against the retrieved corpus (Stage 7 validation). However, this does not prevent misattribution — a real PMID may be cited for a claim it does not support. The confidence labeling stage partially addresses this by cross-referencing claims across multiple papers, but expert spot-checking remains essential.
6.2 Incomplete Literature Coverage
No single retrieval strategy captures all relevant literature. Paywalled full texts limit extraction depth. Non-English-language publications are excluded by default. Preprint coverage introduces timeliness but also unreliability. The agent reports its retrieval statistics transparently, but users must interpret results as partial rather than exhaustive.
6.3 Nomenclature Ambiguity
Gene symbols are overloaded across species and contexts. The entity expansion stage (Stage 2) resolves synonyms against NCBI and HGNC, but less common aliases or recently renamed genes may be missed. Users should verify the canonicalized query interpretation summary before relying on downstream results.
6.4 Causality vs. Association
Literature frequently reports statistical associations without establishing causation. The agent labels evidence types (genetic, pharmacological, expression-based) to help users assess the strength of mechanistic inference, but it cannot independently determine causality. Reports should be interpreted as evidence maps, not causal models.
6.5 Bias Toward Well-Studied Entities
Genes and pathways with extensive literature will be overrepresented in the output, regardless of their actual biological importance. Rarely studied but functionally critical genes may receive weak or speculative labels simply due to limited publication volume. The open questions section is designed to flag such gaps, but users should be aware of this ascertainment bias.
6.6 Not for Clinical Decision-Making
LitPath is designed to support research prioritization and hypothesis generation. It is not a clinical decision support tool. The evidence synthesis it produces should not be used to guide patient treatment, drug selection, or diagnostic decisions without independent expert review and clinical validation.
7. Conclusion
Literature-driven target discovery and pathway evidence synthesis is a well-defined, recurrent task in bioinformatics that is poorly served by both manual review and unstructured language model interactions. LitPath demonstrates that this task can be decomposed into an executable, multi-stage agent skill with clear input-output contracts, built-in validation, and a concrete evaluation framework. The skill paradigm offers three advantages over alternative approaches. First, the pipeline structure is explicit and auditable: every claim in the output can be traced to a specific stage and source. Second, the skill is reusable: the same workflow applies to any disease, gene, or pathway query without modification. Third, the skill is evaluable: each stage and the overall output can be assessed against defined quality criteria, enabling systematic comparison across models and iterative improvement.
The proposed evaluation framework — combining automated validation checks with expert human assessment across multiple dimensions — provides a realistic path toward measuring skill quality. We anticipate that skills like LitPath will become standard components of bioinformatics agent platforms, enabling researchers to delegate the mechanical aspects of literature synthesis while retaining control over interpretation and decision-making.
Future work includes extending the skill to support interactive refinement (where the user can challenge specific claims and request targeted re-analysis), integrating with downstream analysis skills (e.g., connecting the candidate target list directly to expression analysis or docking workflows), and developing a community benchmark for comparative evaluation of literature synthesis skills.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.