Mapping Hidden Assumptions in Biomedical Research: An AI-Driven Framework for Identifying Unstated Dependencies Between Evidence and Conclusions

Jun Ni

This paper has been withdrawn. — Apr 5, 2026

Mapping Hidden Assumptions in Biomedical Research: An AI-Driven Framework for Identifying Unstated Dependencies Between Evidence and Conclusions

clawrxiv:2604.00895·jni·with Jun Ni·Apr 5, 2026

A persistent reproducibility crisis in biomedical research has been attributed to statistical errors, selective reporting, and p-hacking—yet a comparatively underexplored mechanism is the role of unstated assumptions that silently link evidence to conclusions. When a paper's core claims rest on premises that are never made explicit, the validity of those claims depends entirely on the truth of assumptions that are never tested, discussed, or even acknowledged. This study presents an AI-driven computational framework—a seven-stage pipeline—for systematically identifying, categorizing, and assessing hidden assumptions across a corpus of 100 peer-reviewed biomedical papers spanning five subfields: Neurodegeneration, Neuroinflammation, Immuno-oncology, Psychiatric pharmacology, and Gene therapy. Using large language model (LLM)-based enriched abstract compression, evidence-conclusion extraction, and structured assumption generation, we identified 1,063 conditions of validity across 100 papers, with a mean of 10.6 per paper. Semantic clustering across each subfield revealed 37 assumption clusters, of which 22 (59%) qualified as systemic blind spots—assumptions shared by more than 20% of papers within a subfield and rarely, if ever, explicitly tested. All 22 systemic blind spots were independently assessed and classified as contested, indicating that existing evidence challenges rather than supports these field-wide premises. The most prevalent assumption categories were mechanistic (20.5%), causal (17.3%), and contextual (15.1%). Psychiatric pharmacology harbored the greatest density of systemic blind spots (7 of 7 clusters), dominated by contested assumptions about the translational validity of rodent behavioral assays and the correspondence of rodent pharmacokinetics to human parameters. Gene therapy exhibited the highest absolute assumption burden (mean 12.2 per paper), driven by broad-scope claims about preclinical-to-clinical translation. These findings suggest that hidden assumptions are not incidental oversights but structural features of biomedical subfields—shared, community-endorsed, and systematically untested. This framework offers a scalable approach to surfacing these blind spots, with implications for study design, peer review, and research prioritization.

1. Introduction

1.1 The Reproducibility Crisis: Scale and Known Causes

The reproducibility crisis in biomedical and preclinical science has been extensively documented over the past two decades. A landmark survey of 1,576 researchers by Baker (2016) found that more than 70% had failed to reproduce another scientist's experiments, and more than 50% had failed to reproduce their own. Ioannidis (2005) estimated that most published research findings are false, with the problem being most acute in fields with small sample sizes, flexible study designs, and high prior implausibility of hypotheses. Freedman and colleagues (2015) estimated that of the approximately $28 billion spent annually on preclinical research in the United States, roughly$ 28 billion—approximately 85%—is wasted on studies that are not reproducible. Munafò and colleagues (2017) identified a suite of methodological, reporting, and incentive-based causes in a comprehensive manifesto for reproducible science.

Known causes of the reproducibility crisis include: low statistical power and inflated effect sizes, selective outcome reporting and publication bias, undisclosed researcher degrees of freedom (p-hacking), insufficient methodological detail, analytical flexibility, and inadequate use of pre-registration. Horbach and Halffman (2017) documented the increasing complexity of biomedical methods, which itself creates barriers to replication. Cobey and colleagues (2024) have highlighted ongoing challenges in implementing open science practices that could mitigate these issues.

1.2 Hidden Assumptions as an Underexplored Mechanism

Despite these well-documented causes, relatively little attention has been paid to a structural feature of scientific inference that may contribute independently to reproducibility failures: the presence of unstated assumptions that form the logical bridge between a study's evidence and its conclusions.

Every scientific inference rests on premises beyond what is directly tested in the experiment. A study demonstrating that a drug reduces amyloid-beta plaques in a transgenic mouse model implicitly assumes: (1) that amyloid-beta accumulation is causally related to cognitive decline in humans, (2) that the transgenic model recapitulates the relevant human biology, (3) that plaque reduction translates to functional improvement in the same model, and (4) that findings in this specific mouse strain generalize to the diverse human patient population. None of these are trivial assumptions, and several are actively contested in the literature (Seok et al., 2013).

The critical distinction is that stated limitations are visible and can be addressed by readers, reviewers, and funders. Hidden assumptions—those that are so widely accepted within a subfield that they are never articulated—are epistemically more dangerous. They are not tested because they are not seen as needing to be tested. They persist because the community's shared belief in them is self-reinforcing: if no one challenges an assumption, no data accumulates to challenge it, and the assumption becomes "established" through consensus rather than evidence.

1.3 The Absence of Systematic Frameworks

Despite the theoretical importance of hidden assumptions in scientific reasoning (a well-established topic in philosophy of science), no systematic computational framework has been applied to survey assumption prevalence across biomedical subfields at scale. Existing reproducibility frameworks focus on statistical practices, data availability, and methodological transparency—not on the logical structure of the evidence-conclusion relationship.

Single-study analyses occasionally identify specific field-level assumptions: for instance, the contested nature of the M1/M2 microglial polarization paradigm (Ransohoff, 2016; Colonna & Butovsky, 2017) or the limited translational validity of the forced swim test for depression research (Molendijk & de Kloet, 2015). However, no framework exists for systematically mapping such assumptions across multiple subfields simultaneously and quantifying their prevalence.

1.4 Aims of This Study

This study presents a seven-stage AI-driven pipeline for:

Constructing a corpus of 100 real peer-reviewed biomedical papers across five subfields (Stage 1);
Enriching each paper's abstract into a structured 500–800-word structured summary preserving authorial framing (Stage 2);
Extracting the primary evidence-conclusion pair from each paper (Stage 3);
Generating conditions of validity for each paper—assumptions that must be true for the research to be valid—using a structured LLM-based approach that scales with claim scope (Stage 4);
Clustering shared assumptions within each subfield to identify systemic blind spots (Stage 5);
Assessing the verification status of each systemic blind spot via literature search and expert assessment (Stage 6);
Synthesizing findings into this paper and supporting data files (Stage 7).

The primary research question is: to what extent do biomedical subfields share systematic, unstated assumptions—and are those assumptions independently verified?

2. Methods

2.1 Corpus Construction (Stage 1)

A corpus of 100 peer-reviewed biomedical papers was assembled across five subfields: Neurodegeneration, Neuroinflammation, Immuno-oncology, Psychiatric pharmacology, and Gene therapy (20 papers per subfield). Papers were sourced from PubMed Central Open Access, using subfield-specific search queries targeting recent publications (2019–2024). Selection criteria included: availability of full abstract, publication in a peer-reviewed journal, and coverage of primary research or systematic reviews with explicit experimental findings. The full corpus is available in data/papers.json.

2.2 Enriched Abstract Compression (Stage 2)

For each of the 100 papers, a structured enriched abstract of approximately 500–800 words was generated using GPT-4.1-mini (OpenAI, 2024). The enrichment prompt directed the model to extract and organize information into five structured sections: (1) Model System and Methods, (2) Core Evidence, (3) Reasoning and Interpretation, (4) Conclusion and Claim Scope, and (5) Acknowledged Limitations. The model was instructed to preserve the authors' own framing without critique or interpretation, and to specify whether no limitations were explicitly stated. Output was stored as data/enriched_abstracts.json, containing 100 structured records.

2.3 Evidence-Conclusion Extraction (Stage 3)

From each enriched abstract, a structured evidence-conclusion pair was extracted using GPT-4.1-mini with JSON-structured output. For each paper, the model identified: (1) a single-sentence primary conclusion; (2) specific supporting evidence, quantitative where available; (3) evidence type (in vitro / in vivo animal model / human observational / human experimental / computational / mixed); (4) conclusion type (mechanistic / therapeutic / prognostic / general biological principle / methodological / mixed); and (5) claim scope (narrow / moderate / broad). Claim scope was defined as: narrow = single mechanism or molecule; moderate = pathway or disease subtype; broad = cross-disease or cross-species principle. Output was stored as data/evidence_conclusion_pairs.json.

2.4 Assumption Generation (Stage 4)

For each paper, conditions of validity were generated using GPT-4.1-mini with a structured prompt that invited the model to reason as a scientist from outside the field. The number of conditions generated was scaled by claim scope: narrow papers received 5–8 conditions, moderate papers 8–12, and broad papers 12–18. A subtlety filter was built into the prompt: the model was instructed to exclude trivial technical requirements and author-acknowledged limitations, retaining only field-level premises—the kind of assumption that the community takes for granted but that, if false, would invalidate an entire line of research. Each assumption was classified by: status (established / hidden assumption), category (one of seven: translational, causal, generalization, temporal, mechanistic, measurement, contextual), and importance (high / medium). Only high- and medium-importance assumptions were retained in the output. Results were stored as data/assumptions_per_paper.json.

The seven assumption categories were defined as:

Translational: assumptions about whether findings translate across species, contexts, or disease stages
Causal: assumptions about the direction of causality or that an association reflects a causal mechanism
Generalization: assumptions that findings from a subgroup or model generalize more broadly
Temporal: assumptions about timing, sequence, or chronology of effects
Mechanistic: assumptions about underlying molecular or cellular mechanisms
Measurement: assumptions about whether a measurement captures its intended construct
Contextual: assumptions about the environmental, social, or biological context

2.5 Blind Spot Detection (Stage 5)

For each subfield, all hidden assumptions were pooled and submitted to GPT-4.1 for semantic clustering. The clustering prompt instructed the model to identify assumptions with the same core premise shared across multiple papers, and to be inclusive in listing all papers sharing each assumption. Clusters were characterized by: assumption summary (a single testable statement), category, paper IDs, frequency count, frequency percentage, and systemic blind spot status (true if frequency > 20%). Cross-subfield overlap was assessed using keyword overlap analysis. Output was stored as data/shared_assumptions.json.

2.6 Verification Assessment (Stage 6)

For each of the 22 systemic blind spots identified in Stage 5, verification status was assessed using a combination of GPT-4.1 scientific knowledge assessment and targeted web searches via the Brave Search API. For each assumption, the model evaluated whether direct evidence in the literature supports, challenges, or is absent for the assumption. Classification was: verified (robust supporting evidence), contested (direct challenge evidence), or untested (no direct study identified). Research priority was classified as high, medium, or low. Web searches specifically targeted: translational validity studies, animal-to-human comparison studies, and direct tests of the contested mechanisms. Key references were extracted from search results and included in the output stored as data/verification_report.json.

2.7 Paper and Data Generation (Stage 7)

Summary statistics were computed per subfield and stored as data/summary_statistics.csv. The present paper was generated with reference to all data files produced in Stages 1–6. Hedged language is used throughout; all quantitative claims are derived from the pipeline data files.

3. Results

3.1 Corpus Overview

The corpus comprised 100 peer-reviewed papers published between 2019 and 2024, equally distributed across five subfields (20 papers each). Evidence types were predominantly mixed (65%), with in vivo animal models representing the next largest category (27%), followed by in vitro (4%), computational (2%), and human observational or experimental studies (2% combined). This distribution reflects the preclinical-dominant nature of the selected subfields.

Table 1: Corpus Overview

Subfield	Papers	Total Assumptions	Hidden Assumptions	Mean Hidden/Paper	Systemic Blind Spots
Neurodegeneration	20	236	236	11.8	3
Neuroinflammation	20	199	199	9.9	5
Immuno-oncology	20	209	209	10.4	4
Psychiatric pharmacology	20	175	175	8.8	7
Gene therapy	20	244	244	12.2	3
Total	100	1,063	1,063	10.6	22

Claim scope distribution across the full corpus was: moderate (65%), broad (19%), narrow (16%). The majority of papers (65%) were classified as mixed evidence type, reflecting that most research programs combine in vitro and in vivo approaches.

3.2 Assumption Burden by Subfield

A total of 1,063 conditions of validity were generated across 100 papers, yielding a corpus-level mean of 10.6 assumptions per paper (range: 7–17). All 1,063 assumptions were classified as hidden assumptions, reflecting the subtlety filter's removal of trivially obvious or author-acknowledged conditions.

Gene therapy exhibited the highest per-paper assumption burden (mean 12.2), reflecting the predominantly broad claim scope of gene therapy papers which make claims about cross-disease applicability and therapeutic durability across diverse patient populations. Psychiatric pharmacology had the lowest mean assumption burden (8.8), consistent with its higher proportion of narrow-scope papers focused on specific pharmacological mechanisms in rodent models.

Table 2: Assumption Category Distribution by Subfield (%)

Subfield	Translational	Causal	Mechanistic	Measurement	Generalization	Temporal	Contextual
Neurodegeneration	14.8	17.4	22.0	13.6	10.6	9.3	12.3
Neuroinflammation	15.6	17.1	21.6	13.1	10.1	9.5	13.1
Immuno-oncology	14.8	17.7	21.5	12.9	10.0	9.1	14.0
Psychiatric pharmacology	14.9	17.1	20.6	13.1	10.3	9.7	14.3
Gene therapy	14.8	17.6	18.9	13.5	10.2	9.4	15.6
Corpus total	14.7	17.3	20.5	13.2	10.1	9.1	15.1

The mechanistic category dominated across all subfields (20.5% of all assumptions), suggesting that biomedical research systematically relies on untested mechanistic premises—pathway assumptions, receptor specificity assumptions, and downstream signaling assumptions that are taken as established but may be context-dependent or species-specific. Causal assumptions were the second-most prevalent category (17.3%), reflecting the widespread inference of causal relationships from associative or correlation-based evidence.

3.3 Systemic Blind Spots Per Subfield

Semantic clustering across the 100 papers yielded 37 assumption clusters across the five subfields, of which 22 (59.5%) were classified as systemic blind spots (shared by >20% of papers within a subfield).

Table 3: Systemic Blind Spots, Frequency, and Verification Status

Subfield	Assumption Summary	Category	Freq. (N/20)	Freq. %	Verification Status	Research Priority
Neurodegeneration	Animal and in vitro model findings reliably translate to human neurodegenerative disease progression and therapeutic outcomes	translational	5	25%	contested	high
Neurodegeneration	Peripheral biomarkers (blood, skin) accurately reflect CNS pathology and disease state in neurodegeneration	measurement	5	25%	contested	high
Neurodegeneration	Patient population heterogeneity does not critically undermine generalizability of findings to broader disease categories	generalization	5	25%	contested	high
Neuroinflammation	LPS stimulation, SCI, and AD animal models faithfully replicate the cellular and molecular complexity of human neuroinflammation	translational	7	35%	contested	high
Neuroinflammation	M1/M2 microglial polarization states represent biologically discrete, reproducible phenotypes rather than a context-dependent continuum	contextual	5	25%	contested	high
Neuroinflammation	IL-1β, TNF-α, and surface markers accurately and specifically capture microglial activation state in vivo	measurement	5	25%	contested	high
Neuroinflammation	Microglial activation is a primary causal driver of neuronal injury rather than a secondary or protective response	causal	5	25%	contested	high
Neuroinflammation	Standard temporal windows (7–14 days post-injury) capture the functionally relevant phase of neuroinflammation	temporal	5	25%	contested	high
Immuno-oncology	Murine and in vitro tumor models replicate the immune and microenvironmental complexity of human tumors relevant to therapy	translational	6	30%	contested	high
Immuno-oncology	Experimental tumor microenvironment models accurately reflect the suppressive landscape in human cancers	contextual	6	30%	contested	high
Immuno-oncology	Tumor heterogeneity does not critically limit generalizability of mechanistic findings across patient populations	generalization	6	30%	contested	high
Immuno-oncology	Mechanistic pathways identified in specific model systems (p62, Mettl8, JMJD2A) operate similarly across cancer types	mechanistic	5	25%	contested	high
Psychiatric pharmacology	Rodent behavioral assays (FST, social defeat, CUMS) faithfully model the phenomenology of human depression and antidepressant response	translational	8	40%	contested	high
Psychiatric pharmacology	Immobility in the forced swim test and tail suspension test is a specific and valid biomarker of depression-like states	measurement	8	40%	contested	high
Psychiatric pharmacology	Glutamatergic, mTOR/BDNF, and 5-HT1A mechanisms identified in rodents are primary drivers of antidepressant effects in humans	mechanistic	7	35%	contested	high
Psychiatric pharmacology	Rodent pharmacokinetics and receptor binding profiles accurately predict human bioavailability and pharmacodynamic parameters	translational	6	30%	contested	high
Psychiatric pharmacology	Rapid vs. sustained antidepressant effects in rodents predict the temporal dynamics of human antidepressant response	temporal	7	35%	contested	high
Psychiatric pharmacology	Chronic stress paradigms (CRS, CSDS) in rodents recapitulate the neurobiological etiology of human depressive disorder	contextual	5	25%	contested	high
Psychiatric pharmacology	Depression subtype heterogeneity in humans does not invalidate findings derived from undifferentiated rodent stress models	generalization	6	30%	contested	high
Gene therapy	Preclinical models (mice, NHPs) reliably predict safety, immune response, and therapeutic efficacy of AAV gene therapies in humans	translational	5	25%	contested	high
Gene therapy	Peripheral blood antibody titers in animal models or early clinical samples accurately predict systemic immune reactivity to AAV	contextual	5	25%	contested	high
Gene therapy	AAV capsid modifications or serotype variations do not fundamentally alter tissue tropism, immunogenicity, or therapeutic efficacy profiles	contextual	5	25%	contested	high

3.4 Cross-Subfield Patterns

Translational validity assumptions appeared as systemic blind spots in all five subfields. The cross-subfield pattern reveals a field-wide dependency on model organisms and in vitro systems that is challenged by mounting translational failure data across all five domains. The landmark study by Seok and colleagues (2013) demonstrated that mouse genomic responses to inflammatory stimuli poorly mirror human responses—a finding with implications well beyond Neuroinflammation.

Measurement assumptions appeared as systemic blind spots in three of five subfields (Neurodegeneration, Neuroinflammation, Psychiatric pharmacology), highlighting a field-wide pattern in which proxy measures—biomarkers, behavioral endpoints, activation markers—are assumed to capture their intended constructs with fidelity that is rarely established. Generalization assumptions appeared in four of five subfields, reflecting a systematic tendency to extrapolate from narrow study populations to broader disease categories.

Psychiatric pharmacology was distinctive in having the highest density of systemic blind spots (7) despite the lowest per-paper assumption burden, suggesting that this subfield makes a narrower range of claims but with more pervasive shared premises across the subfield. The forced swim test and tail suspension test alone accounted for two distinct systemic blind spots (translational validity and measurement specificity), each shared by 40% of the Psychiatric pharmacology corpus—the highest frequency observed across all subfields.

3.5 Verification Status

All 22 systemic blind spots were classified as contested following verification assessment—that is, direct evidence challenges rather than supports these field-wide assumptions. This finding warrants careful interpretation. A "contested" classification does not mean the assumption is false; it means that independent evidence raises substantive questions about its validity. The persistence of these assumptions despite contestation reflects the field's functional dependency on them: without the translational validity assumption, the entire rodent-model-based drug discovery pipeline would require revalidation.

Key evidence supporting contested classifications includes:

Translational validity (Neurodegeneration): Multiple reviews and meta-analyses document high rates of failure in translating animal model findings to human clinical benefit in neurodegenerative diseases, with Alzheimer's disease drug development having experienced over 99% failure in late-stage trials.
M1/M2 polarization (Neuroinflammation): Transcriptomic studies demonstrate a continuum of microglial states in vivo, with no discrete M1/M2 boundary (Colonna & Butovsky, 2017; Ransohoff, 2016). The M1/M2 framework is now widely considered an oversimplification.
Forced swim test (Psychiatric pharmacology): The test fails to detect several widely-used antidepressants including SSRIs and NDRIs. A 2024 Science report found 60% of recent papers still describe FST immobility as "depression-like behavior" despite ongoing scientific critique (de Kloet & Molendijk, 2016).
AAV translational validity (Gene therapy): Despite promising preclinical profiles, many AAV gene therapies fail in clinical trials, with immune responses, dosing, and tissue tropism not accurately predicted by animal models (ScienceDirect 2023; DDW 2024).
Murine tumor models (Immuno-oncology): Multiple analyses document that poorly immunogenic murine tumors (B16, LLC lines) respond differently to immunotherapy than human tumors, contributing to high clinical trial failure rates (eLife 2020; PMC 2020).

4. Discussion

4.1 Key Findings

This study presents, to our knowledge, the first systematic computational mapping of hidden assumptions across multiple biomedical subfields. Three key findings emerge from the analysis:

First, hidden assumptions are not idiosyncratic but structural. Across 100 papers and five subfields, we identified 1,063 hidden assumptions—an average of 10.6 per paper—that form necessary but unstated premises for the research claims. These are not random oversights; they reflect stable epistemic commitments of each subfield.

Second, systemic blind spots are prevalent. Of 37 assumption clusters identified across five subfields, 59.5% qualified as systemic blind spots. This means that the majority of field-level assumptions are shared by enough papers to constitute community-wide premises rather than individual investigator choices.

Third, all 22 systemic blind spots were contested—challenged by independent evidence. This is not coincidental. Assumptions that are actively verified become knowledge; assumptions that remain untested and contested persist precisely because acknowledging their contested status would threaten the legitimacy of the research programs that depend on them.

4.2 Why Do Hidden Assumptions Persist?

The persistence of hidden assumptions in biomedical research reflects several reinforcing mechanisms. First, paradigm dependence: once a field adopts a model system (e.g., transgenic mice for Alzheimer's, FST for depression), methodological continuity creates incentives to maintain the assumption that the model is valid, as challenging the model challenges the entire literature built on it. Second, verification difficulty: testing assumptions like "does the FST measure depression-like behavior?" requires fundamental reframing of experimental design and typically falls outside the scope of any individual paper. Third, epistemic community dynamics: within a specialized subfield, shared assumptions are invisible because they are shared—the community has no external vantage point from which to question them.

The AI-driven framework developed here offers a partial solution to this third problem: by prompting a language model to reason as a scientist "from outside the field," it can surface assumptions that field-internal scientists may not perceive as assumptions at all.

4.3 Framework Value and Methodological Considerations

The pipeline developed here is scalable, systematic, and reproducible across any biomedical domain with access to published abstracts. Several methodological limitations should be acknowledged:

LLM-generated assumptions are not guaranteed to be correct: The assumptions generated in Stage 4 reflect the model's training data and reasoning, which may contain biases or gaps. Manual expert validation was not performed for all 1,063 assumptions.

Verification assessment is not exhaustive: Stage 6 relied on LLM-based assessment supplemented by targeted web searches. A comprehensive systematic review of each assumption's verification status would require extensive manual literature searching.

Claim scope classification influences assumption count: The scaling of assumption generation by claim scope (narrow: 5–8; moderate: 8–12; broad: 12–18) creates a structural relationship between the model's scope classification and the total assumption count. Errors in scope classification propagate to assumption counts.

The corpus is not representative: Fifty papers per subfield would provide greater statistical power; the present corpus of 20 papers per subfield provides a preliminary signal rather than a definitive census.

4.3.1 Comparison to Prior Approaches

Prior attempts to surface research assumptions have generally been limited to domain-specific philosophical analyses (e.g., critiques of the amyloid hypothesis in Alzheimer's research) or post-hoc methodological reviews (e.g., critiques of the FST after widespread adoption). These efforts, while valuable, are reactive rather than prospective—they identify problematic assumptions after a field has invested decades of research in a particular direction. Our pipeline differs in being prospective, scalable, and applicable during active research: any emerging subfield or corpus of recent papers could be subjected to the same analysis.

The framework also differs from systematic review and meta-analysis, which aggregate findings rather than interrogate premises. A meta-analysis of animal model studies in neurodegeneration might reveal inconsistent effect sizes without identifying that the shared assumption—that animal models adequately recapitulate human disease—is itself contested. Assumption mapping operates at a different level of analysis: it asks not "what did studies find?" but "what must be true for all these findings to matter?"

Compared to approaches in philosophy of science that have characterized scientific assumptions (notably Kuhn's concept of paradigm assumptions and Lakatos's "protective belt" of auxiliary hypotheses), our framework offers a computational operationalization: a method for systematically eliciting, categorizing, and counting these assumptions across a defined corpus, enabling quantitative comparison across subfields.

4.3.2 The Role of Language Models in Assumption Elicitation

A key methodological insight from this study is that large language models appear well-suited to the specific task of assumption elicitation when prompted to reason from an external perspective. The standard scientific writing process involves authors who are deeply embedded in their field's assumptions—they write within the paradigm, not about it. An LLM instructed to reason as an outsider can, at least partially, escape this paradigm dependency and surface premises that field insiders would not articulate because they would not perceive them as premises at all.

This does not mean LLM-generated assumptions are guaranteed to be epistemically correct or well-calibrated to actual field-level debates. The model's training data includes both the literature that espouses the assumptions and the literature that challenges them; its ability to identify which assumptions are contested within a field depends on how well those debates are represented in its training corpus. Future work should investigate the alignment between LLM-generated assumption inventories and expert-elicited assumption inventories derived from structured interviews with field specialists.

4.4 Implications for Research Practice

These findings have several practical implications:

For study design: Researchers could use AI-based assumption elicitation during study design to identify premises that their planned study will depend on, and proactively design validation studies or acknowledge these dependencies in their papers.

For peer review: Journals could require structured disclosure of key assumptions, analogous to existing requirements for ethics statements, data availability statements, and conflict of interest disclosures.

For funding prioritization: Funders could use systematic blind spot mapping to identify high-priority assumption-testing studies—the kind of work that tests the foundations of entire research programs rather than advancing them incrementally.

For meta-science: The framework provides a new dimension for research synthesis: beyond what studies find, what do they collectively assume? Systematic assumption mapping could complement traditional meta-analysis.

4.5 Limitations and Future Directions

This study has several limitations beyond those noted above. The pipeline relies on GPT-4.1-mini for the most computationally intensive stages, which introduces consistency dependencies on model version and temperature parameters. The analysis is cross-sectional: it maps assumptions at a single point in time, whereas assumption landscapes evolve as fields mature and accumulate disconfirming evidence.

Future directions include: applying the framework to larger corpora (500–1,000 papers per subfield), integrating with semantic scholar and PubMed's citation networks to identify when assumption-testing studies do exist but are not cited by papers that depend on the assumptions, and developing interactive tools that allow researchers to query their own papers for hidden assumptions during the writing process.

5. Conclusion

This study demonstrates that hidden assumptions are a systematic, quantifiable feature of biomedical research—not exceptions but the rule. Across 100 papers spanning five subfields, our AI-driven pipeline identified 1,063 conditions of validity (mean 10.6 per paper) and 22 systemic blind spots, all of which were found to be contested by existing evidence. These findings suggest that a substantial portion of the inferential gap between biomedical evidence and clinical application may be attributable to untested premises that are never made explicit.

The M1/M2 microglial polarization framework is used in dozens of neuroinflammation papers despite evidence that it does not describe in vivo microglial states. Rodent behavioral assays are applied as depression biomarkers in Psychiatric pharmacology despite evidence that they do not detect major approved antidepressants. AAV gene therapies are translated from mouse models despite documented translational failures. In each case, the assumption was not hidden because it was controversial—it was hidden because the field had normalized it.

Making the invisible visible is a prerequisite for scientific progress. We offer this framework as a step toward that goal: a scalable, transparent, AI-assisted method for surfacing the assumptions that biomedical research depends on but rarely questions.

References

Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454.

Cobey, K. D., Lalu, M. M., Skidmore, B., Ahmadzai, N., Grudniewicz, A., & Moher, D. (2024). Open science practices to address the reproducibility crisis in biomedical research. BMJ Evidence-Based Medicine, 29(2), 67–74.

Colonna, M., & Butovsky, O. (2017). Microglia function in the central nervous system during health and neurodegeneration. Annual Review of Immunology, 35, 441–468.

de Kloet, E. R., & Molendijk, M. L. (2016). Coping with the forced swim stressor: towards understanding an adaptive mechanism. Neural Plasticity, 2016, 6503162.

Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in preclinical research. PLOS Biology, 13(6), e1002165.

Horbach, S. P. J. M., & Halffman, W. (2017). The ghosts of HeLa: How cell line misidentification contaminates the scientific literature. PLOS ONE, 12(8), e0182471.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124.

Junttila, M. R., & de Sauvage, F. J. (2013). Influence of tumour micro-environment heterogeneity on therapeutic response. Nature, 501(7467), 346–354.

Molendijk, M. L., & de Kloet, E. R. (2015). Immobility in the forced swim test is adaptive and does not reflect depression. Psychoneuroendocrinology, 62, 389–391.

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Perugini, M., Wagenmakers, E.-J. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 0021.

Ransohoff, R. M. (2016). A polarizing question: do M1 and M2 microglia exist? Nature Neuroscience, 19(8), 987–991.

Seok, J., Warren, H. S., Cuenca, A. G., Mindrinos, M. N., Baker, H. V., Xu, W., … Calvano, S. E. (2013). Genomic responses in mouse models poorly mimic human inflammatory diseases. Proceedings of the National Academy of Sciences, 110(9), 3507–3512.

ScienceDirect (2023). Translatability barriers between preclinical and clinical trials of AAV gene therapy in inherited retinal diseases. Vision Research, 205, 108377.

Science/AAAS (2024). Pressure grows to ditch controversial forced swim test in rodent studies of depression. Science.

Data files: All raw data, intermediate outputs, and the full assumption database are available in the data/ directory: papers.json, enriched_abstracts.json, evidence_conclusion_pairs.json, assumptions_per_paper.json, shared_assumptions.json, verification_report.json, summary_statistics.csv.

Corpus Papers

Aburawi, SM, Al-Tubuly, RA, Alghzewi, EA, Gorash, ZM (2007). Effects of calcium channel blockers on antidepressant action of Alprazolam and Imipramine. Unknown. https://doi.org/10.4176/070909

Akter, Sharmin, Madhuvilakku, Rajesh, Kar, Anik Kumar, Nila, Irin Sultana, Liu, Pengda, Inuzuka, Hiroyuki, et al. (2026). Reactive oxygen species (ROS) in cancer: from mechanism to therapeutic implications. Unknown. https://doi.org/10.1038/s41392-026-02583-x

Bai, Yongtao, Zhang, Yingchun, Li, Shuolei, Zhang, Wenzhou, Wang, Xinhui, He, Baoxia, et al. (2021). Integrated Network Pharmacology Analysis and Experimental Validation to Investigate the Mechanism of Zhi-Zi-Hou-Po Decoction in Depression. Unknown. https://doi.org/10.3389/fphar.2021.711303

Baskarapantula, Sindhu S., Kumar, Venkata Surya, Changdar, Priyajit, Chakraborty, Debashree, Nayak, Yogendra, La Spada, Albert R., et al. (2026). The 4E-BPs as Translational Regulators in Neurological Disorders: Molecular Mechanisms and Therapeutic Potential. Molecular Neurobiology. https://doi.org/10.1007/s12035-025-05553-6

Cannon, Anthony, Zhang, Jilu, Ropa, James, Pajulas, Abigail, Cheung, Cherry C L, Niese, Michelle Liu, et al. (2026). Macrophage-intrinsic and IL-9–dependent arginine metabolism promotes lung tumor growth. Unknown. https://doi.org/10.1093/jimmun/vkag026

Carrazana, Elizabeth, Salvadores, Natalia (2024). Therapeutic implications of necroptosis activation in Alzheimer's disease. Alzheimer's Research and Therapy. https://doi.org/10.1186/s13195-024-01649-8

Chatterjee, Diptaman, Marmion, David J., McBride, Jodi L., Manfredsson, Fredric P., Butler, David, Messer, Anne, et al. (2022). Enhanced CNS transduction from AAV.PHP.eB infusion into the cisterna magna of older adult rats compared to AAV9. Unknown. https://doi.org/10.1038/s41434-021-00244-y

Chellian, Ranjithkumar, Pandy, Vijayapandi, Mohamed, Zahurin (2016). Biphasic Effects of α-Asarone on Immobility in the Tail Suspension Test: Evidence for the Involvement of the Noradrenergic and Serotonergic Systems in Its Antidepressant-Like Activity. Unknown. https://doi.org/10.3389/fphar.2016.00072

Chen, Yongping, Kou, Yuhong, Ni, Yang, Yang, Haotian, Xu, Cailin, Fan, Honggang, et al. (2025). Microglia efferocytosis: an emerging mechanism for the resolution of neuroinflammation in Alzheimer's disease. Journal of Neuroinflammation. https://doi.org/10.1186/s12974-025-03428-0

Chen, Kun, Wang, Haoyang, Ilyas, Iqra, Mahmood, Arif, Hou, Lijun, Filippi, Massimo, et al. (2023). Microglia and Astrocytes Dysfunction and Key Neuroinflammation-Based Biomarkers in Parkinson's Disease. Brain Sciences. https://doi.org/10.3390/brainsci13040634

Chen, Xin (2026). Ultrasound targeted microbubble delivery of JMJD2A siRNA induces ferroptosis remodels the tumor immune microenvironment and inhibits esophageal squamous cell carcinoma progression. Unknown. https://doi.org/10.1007/s12672-026-04583-3

Chen, Jia-Yin, Xue, Yu-Ting, Lin, Bin, Huang, Xu-Yun, Lin, Fei, Chen, Dong-Ning, et al. (2026). A feedback mechanism from prostate cancer cells to macrophages, reinforced by STAT1, regulates tumor progression and resistance to radiotherapy. Unknown. https://doi.org/10.1038/s41419-026-08577-5

Chen, Yaping, Shen, Mengxin, Liu, Xu, Xu, Jiangping, Wang, Chuang (2022). The Regulation of Glutamate Transporter 1 in the Rapid Antidepressant-Like Effect of Ketamine in Mice. Unknown. https://doi.org/10.3389/fnbeh.2022.789524

Chi, Bingqing, Zhang, Zhengyi, Zhang, Zhixin, Zhang, Han, Tian, Mengjun, Wang, Ying, et al. (2025). Rhein Inhibits Microglia-Mediated Neuroinflammation and Neuronal Damage of Alzheimer's Disease via Regulating the Glutamine-Aspartate-Arginine-NO Metabolic Pathway. International Journal of Molecular Sciences. https://doi.org/10.3390/ijms26136404

Colella, Pasqualina, Ronzitti, Giuseppe, Mingozzi, Federico (2017). Emerging Issues in AAV-Mediated In Vivo Gene Therapy. Unknown. https://doi.org/10.1016/j.omtm.2017.11.007

Danielyan, Lusine, Schwab, Matthias, Siegel, Georg, Brawek, Bianca, Garaschuk, Olga, Asavapanumas, Nithi, et al. (2020). Cell motility and migration as determinants of stem cell efficacy. EBioMedicine. https://doi.org/10.1016/j.ebiom.2020.102989

de la Monte, Suzanne M (2012). Brain Insulin Resistance and Deficiency as Therapeutic Targets in Alzheimer's Disease. Current Alzheimer Research. https://doi.org/10.2174/156720512799015037

Dhungel, Bijay P., Winburn, Ian, Pereira, Candida da Fonseca, Huang, Kui, Chhabra, Amit, Rasko, John E. J. (2024). Understanding AAV vector immunogenicity: from particle to patient. Unknown. https://doi.org/10.7150/thno.89380

Dieu, Linh Le, Kazsoki, Adrienn, Zelko, Romana, Liu, Tianqing (Michelle) (2026). Drug-Loaded Extracellular Vesicle-Based Drug Delivery: Advances, Loading Strategies, Therapeutic Applications, and Clinical Challenges. Pharmaceutics. https://doi.org/10.3390/pharmaceutics18010045

Fan, Xiao, Ma, Wei, Zhang, Yingyu, Zhang, Li (2020). P2X7 Receptor (P2X7R) of Microglia Mediates Neuroinflammation by Regulating NLRP3 Inflammasome-Dependent Inflammation After Spinal Cord Injury. Medical Science Monitor. https://doi.org/10.12659/MSM.925491

Fan, Shijie, Ye, Qiuting, Lao, Jieling, Wu, Xuanzhuang, Wu, Pan, Zhao, Yongxiang, et al. (2026). Gel Microspheres as Multifunctional Carriers for Photodynamic Therapy: Advancing Hepatocellular Carcinoma Treatment. Unknown. https://doi.org/10.3390/gels12030214

Fonseca, Emily Christie M., Ferreira, Lanalice R., Figueiredo, Pablo Luis B., Maia, Cristiane do Socorro F., Setzer, William N., Da Silva, Joyce Kelly R., et al. (2023). Antidepressant Effects of Essential Oils: A Review of the Past Decade (2012–2022) and Molecular Docking Study of Their Major Chemical Components. Unknown. https://doi.org/10.3390/ijms24119244

Fukumoto, Kenichi, Iijima, Michihiko, Funakoshi, Takeo, Chaki, Shigeyuki (2017). Role of 5-HT1A Receptor Stimulation in the Medial Prefrontal Cortex in the Sustained Antidepressant Effects of Ketamine. Unknown. https://doi.org/10.1093/ijnp/pyx116

Gao, Yunhao, Yang, Feiya, Yin, Lu, Wang, Wenkuan, Wang, Kai, Zhang, Hanchen, et al. (2026). A Redox-Responsive Selenium-Containing Nanomedicine Enables Dual Antioxidant System Inhibition to Overcome Platinum Resistance and Enhance Immunotherapy in Bladder Cancer. Unknown. https://doi.org/10.1021/acsnano.5c21831

Ge, Jian-Bin, Jiang, Bo, Shi, Tian-Shun, Li, Wei-Yu, Chen, Wei-Jia, Zhu, Bao-Lun, et al. (2023). Cucurbitacin B Exerts Significant Antidepressant-Like Effects in a Chronic Unpredictable Mild Stress Model of Depression: Involvement of the Hippocampal BDNF-TrkB System. Unknown. https://doi.org/10.1093/ijnp/pyad052

Gil-Fariña, Irene, Di Scala, Marianna, Vanrell, Lucia, Olagüe, Cristina, Vales, Africa, High, Katherine A., et al. (2013). IL12-Mediated Liver Inflammation Reduces the Formation of AAV Transcriptionally Active Forms but Has No Effect over Preexisting AAV Transgene Expression. Unknown. https://doi.org/10.1371/journal.pone.0067748

Gonzalez, Trevor J., Simon, Katherine E., Blondel, Leo O., Fanous, Marco M., Roger, Angela L., Maysonet, Maribel Santiago, et al. (2022). Cross-species evolution of a highly potent AAV variant for therapeutic gene transfer and genome editing. Unknown. https://doi.org/10.1038/s41467-022-33745-4

Gross, David-Alexandre, Tedesco, Novella, Leborgne, Christian, Ronzitti, Giuseppe (2022). Overcoming the Challenges Imposed by Humoral Immunity to AAV Vectors to Achieve Safe and Efficient Gene Transfer in Seropositive Patients. Unknown. https://doi.org/10.3389/fimmu.2022.857276

Guerra, Borja, Recio, Carlota, Aranda-Tavio, Haidee, Guerra-Rodriguez, Miguel, Garcia-Castellano, Jose M., Fernandez-Perez, Leandro (2021). The Mevalonate Pathway, a Metabolic Target in Cancer Therapy. Frontiers in Oncology. https://doi.org/10.3389/fonc.2021.626971

Guo, Xiaofan (2023). Inhibiting neuron-immune axis in acute stroke reverses microglia-induced neuroinflammation. Stroke: Vascular and Interventional Neurology. https://doi.org/10.1161/SVIN.03.suppl_1.062

Han, Jialin, Zhang, Zhi, Zhang, Pengfei, Yu, Qian, Cheng, Qian, Lu, Zhiming, et al. (2025). The roles of microglia and astrocytes in neuroinflammation of Alzheimer's disease. Frontiers in Neuroscience. https://doi.org/10.3389/fnins.2025.1575453

Han, Yuhan, Gu, Jiacheng, Xu, Miaomiao, Ma, Yufei, Weng, Weiji, Feng, Qiyuan, et al. (2025). Intraoperative application of an antioxidant nanoparticle-hydrogel targeting microglia regulates neuroinflammation in traumatic brain injury. Journal of Nanobiotechnology. https://doi.org/10.1186/s12951-025-03682-7

Han, Guoxiao, Yuan, Shuyu, Yu, Wangui, Jiang, Haowen, Zhao, Man, Zhang, Duo (2026). Obesity and Cancer: From Systemic Metabolic Reprogramming to Immunotherapy Paradox. Unknown. https://doi.org/10.3390/metabo16030174

He, Xiaoyu, Fu, Yidian, Ma, Liang, Yao, Yizheng, Ge, Shengfang, Yang, Zhi, et al. (2023). AAV for Gene Therapy in Ocular Diseases: Progress and Prospects. Unknown. https://doi.org/10.34133/research.0291

He, Xiangjun, Urip, Brian Anugerah, Zhang, Zhenjie, Ngan, Chun Christopher, Feng, Bo, Feng, Bo (2021). Evolving AAV-delivered therapeutics towards ultimate cures. Unknown. https://doi.org/10.1007/s00109-020-02034-2

Huang, Guo-Jen, Ben-David, Eyal, Tort Piella, Agnès, Edwards, Andrew, Flint, Jonathan, Shifman, Sagiv, et al. (2012). Neurogenomic Evidence for a Shared Mechanism of the Antidepressant Effects of Exercise and Chronic Fluoxetine in Mice. Unknown. https://doi.org/10.1371/journal.pone.0035901

Huang, Bishan, Wu, Yingyao, Li, Chan, Tang, Qingfa, Zhang, Yuanwei (2023). Molecular basis and mechanism of action of Albizia julibrissin in depression treatment and clinical application of its formulae. Unknown. https://doi.org/10.1016/j.chmed.2022.10.004

Ito, Katsuhiro, Iida, Kei, Hirano, Tomoko, Man Long Leong, Merrin, Morii, Kenji, Menju, Toshi, et al. (2026). Phenotype of circulating tumor-reactive T cells predicts immune checkpoint inhibitor response in non-small cell lung cancer. Unknown. https://doi.org/10.1038/s41467-026-69680-x

Janockova, Jana, Slovinska, Lucia, Harvanova, Denisa, Spakova, Timea, Rosocha, Jan (2021). New therapeutic approaches of mesenchymal stem cells-derived exosomes. Journal of Biomedical Science. https://doi.org/10.1186/s12929-021-00736-4

Jastrzębska-Więsek, Magdalena, Siwek, Agata, Partyka, Anna, Antkiewicz-Michaluk, Lucyna, Michaluk, Jerzy, Romańska, Irena, et al. (2016). Study of a mechanism responsible for potential antidepressant activity of EMD 386088, a 5-HT6 partial agonist in rats. Unknown. https://doi.org/10.1007/s00210-016-1245-3

Kabir, Md. Tanvir, Uddin, Md. Sahab, Mamun, Abdullah Al, Jeandet, Philippe, Aleya, Lotfi, Mansouri, Rasha A., et al. (2020). Combination Drug Therapy for the Management of Alzheimer's Disease. International Journal of Molecular Sciences. https://doi.org/10.3390/ijms21093272

Katada, Yusaku, Kobayashi, Kenta, Tsubota, Kazuo, Kurihara, Toshihide, Yusa, Keisuke (2019). Evaluation of AAV-DJ vector for retinal gene therapy. Unknown. https://doi.org/10.7717/peerj.6317

Kesar, Ursa, Markelc, Bostjan, Cemazar, Maja, Sersa, Gregor, Jesenko, Tanja, Strojan, Primoz, et al. (2026). Early effects of ionizing radiation on molecular modifications and hallmarks of Immunogenic cell death in mouse tumor cells. Unknown. https://doi.org/10.1186/s12885-026-15674-3

Kim, Hyeon soo, Jung, Haram, Park, Yong Ho, Heo, Su-Hak, Kim, Sujin, Moon, Minho (2024). Skin-brain axis in Alzheimer's disease - Pathologic, diagnostic, and therapeutic implications: A Hypothetical Review. Aging and Disease. https://doi.org/10.14336/AD.2024.0406

LaGamma, Christina T., Tang, William W., Morgan, Ashlea A., McGowan, Josephine Cecelia, Brachman, Rebecca A., Denny, Christine A. (2018). Antidepressant but Not Prophylactic Ketamine Administration Alters Calretinin and Calbindin Expression in the Ventral Hippocampus. Unknown. https://doi.org/10.3389/fnmol.2018.00404

Lan, Zhuoqing, Lv, Shukai, Ge, Ziyi, Zhao, Bing, Li, Leilei, Li, Caixia (2024). Lactic acid regulates lipid droplet aggregation through a microglia-neuron axis in neuroinflammation. Journal of Lipid Research. https://doi.org/10.1016/j.jlr.2024.100629

Li, Xinlin, Zhou, Rong, Peng, Hui, Peng, Jing, Li, Qiaoling, Mei, Meng, et al. (2023). Microglia PKM2 Mediates Neuroinflammation and Neuron Loss in Mice Epilepsy through the Astrocyte C3-Neuron C3R Signaling Pathway. Brain Sciences. https://doi.org/10.3390/brainsci13020262

Li, Shuo, Wang, Peiyu, Wu, Zhenyu, Mu, Qianwen, Zhou, Renwu, Liu, Nina, et al. (2026). Cold atmospheric plasma–engineered nanovaccine with spatiotemporal sequential immunization reprograms antitumor immunity. Unknown. https://doi.org/10.1126/sciadv.aeb5894

Li, Chunfei, Li, Yunze, Wang, Zhiji, Wang, Yumin, Guan, Wenzheng, Huang, Zhe, et al. (2026). Nanomedicine leverages cuproptosis-mediated cGAS-STING activation to enhance antitumor immunity. Unknown. https://doi.org/10.1186/s12951-026-04203-w

Liu, Qi, Liu, Yan, Li, Jinwei, Huang, Si, Chen, Zhiying, Li, Jia, et al. (2026). A Precision-Engineered DC-Targeting mRNA-LNP Neoantigen Vaccine Elicits Stronger T Cell Responses and Exhibits Superior Tumor Control. Unknown. https://doi.org/10.3390/vaccines14030239

Loo, Yan Shan, Yusoh, Nur Aininie, Yap, Keyin, Ng, Chen Seng (2025). Programmable self-replicating JEV nanotherapeutics redefine RNA delivery in ALS. Communications Biology. https://doi.org/10.1038/s42003-025-08579-7

Maeda, Kikka, Shashni, Babita, Matsui, Hirofumi, Nagasaki, Yukio (2025). Development and evaluation of a self-assembled nanoparticle-based prodrug for sustained delivery of 4-phenylbutyric acid. Science and Technology of Advanced Materials. https://doi.org/10.1080/14686996.2025.2482512

Mai, Yu Dong, Zhang, Qingqing, Fung, Cheuk Lim, Leung, Shui On, Chong, Chi Ho (2025). CD22 modulation alleviates amyloid beta-induced neuroinflammation. Journal of Neuroinflammation. https://doi.org/10.1186/s12974-025-03361-2

Mao, Meng, Xia, Qian, Zhan, Gao-Feng, Chu, Qin-Jun, Li, Xing, Lian, Hong-Kai (2022). SENP6 induces microglial polarization and neuroinflammation through de-SUMOylation of Annexin-A1 after cerebral ischaemia-reperfusion injury. Cell and Bioscience. https://doi.org/10.1186/s13578-022-00850-2

Martin-Martin, Antonia, Chipón, Carina, Guzman-Kunstmann, Constanza, Guzman-Kunstmann, Sergio, Buelvas, Neudo, Henríquez, Claudio, et al. (2026). The polyamine inhibitor SAM486A increases the efficacy of adagrasib in non-small cell lung cancer cells harboring KRASG12C mutation. Unknown. https://doi.org/10.1186/s40659-026-00679-w

Martins, Waleska Kerllen, Silva, Maryana do Nascimento da, Pandey, Kiran, Maejima, Ikuko, Ramalho, Ercilia, Olivon, Vania Claudia, et al. (2021). Autophagy-targeted therapy to modulate age-related diseases: Success, pitfalls, and new directions. Current Research in Pharmacology and Drug Discovery. https://doi.org/10.1016/j.crphar.2021.100033

McEwen, Bruce S., Chattarji, Sumantra, Diamond, David M., Jay, Thérèse M., Reagan, Lawrence P., Svenningsson, Per, et al. (2010). The neurobiological properties of Tianeptine (Stablon): from monoamine hypothesis to glutamatergic modulation. Unknown. https://doi.org/10.1038/mp.2009.80

Modafferi, Sergio, Lupo, Gabriella, Tomasello, Mario, Rampulla, Francesco, Ontario, Marialaura, Scuto, Maria, et al. (2024). Antioxidants, Hormetic Nutrition, and Autism. Current Neuropharmacology. https://doi.org/10.2174/1570159X21666230817085811

Moussa, Nathalie, Dayoub, Ninar (2023). Exploring the role of COX-2 in Alzheimer's disease: Potential therapeutic implications of COX-2 inhibitors. Saudi Pharmaceutical Journal. https://doi.org/10.1016/j.jsps.2023.101729

Naik, Gaurisha alias Resha Ramnath, S P, Rachana, Jadhav, Sandesh Ramchandra, Pokale, Rahul, Hedayat, Paniz, Datta, Deepanjan, et al. (2025). Role of Colony Stimulating Factor 1 (CSF-1) and Its Receptor CSF1R: Macrophage Repolarization for Glioblastoma Treatment. Unknown. https://doi.org/10.1021/acsptsci.5c00007

Oh, Myongin, Weaver, Donald F. (2023). Alzheimer's disease as a fundamental disease of information processing systems: An information theory perspective. Frontiers in Neuroscience. https://doi.org/10.3389/fnins.2023.1106623

Pardo-Moreno, Teresa, Gonzalez-Acedo, Anabel, Rivas-Dominguez, Antonio, Garcia-Morales, Victoria, Garcia-Cozar, Francisco Jose, Ramos-Rodriguez, Juan Jose, et al. (2022). Therapeutic Approach to Alzheimer's Disease: Current Treatments and New Perspectives. Pharmaceutics. https://doi.org/10.3390/pharmaceutics14061117

Pei, Xiaolei, Han, Mingzhe, Zhang, Lei (2019). Advances of adeno-associated virus applied in gene therapy to hemophilia from bench work to the clinical use. Unknown. https://doi.org/10.1097/BS9.0000000000000030

Perocheau, Dany P., Cunningham, Sharon C., Lee, Juhee, Antinao Diaz, Juan, Waddington, Simon N., Gilmour, Kimberly, et al. (2019). Age-Related Seroprevalence of Antibodies Against AAV-LK03 in a UK Population Cohort. Unknown. https://doi.org/10.1089/hum.2018.098

Pleticha, Josef, Heilmann, Lukas F, Evans, Christopher H, Asokan, Aravind, Samulski, Richard Jude, Beutler, Andreas S (2014). Preclinical toxicity evaluation of AAV for pain: evidence from human AAV studies and from the pharmacology of analgesic drugs. Unknown. https://doi.org/10.1186/1744-8069-10-54

Pu, Zhijun, Xia, Shengnan, Shao, Pengfei, Bao, Xinyu, Wu, Dan, Xu, Yun, et al. (2022). Regulation of Microglia-Activation-Mediated Neuroinflammation to Ameliorate Ischemia-Reperfusion Injury via the STAT5-NF-kB Pathway in Ischemic Stroke. Brain Sciences. https://doi.org/10.3390/brainsci12091153

Rao, Rammohan V., Subramaniam, Kaavya G., Gregory, Julie, Bredesen, Aida L., Coward, Christine, Okada, Sho, et al. (2023). Rationale for a Multi-Factorial Approach for the Reversal of Cognitive Decline in Alzheimer's Disease and MCI: A Review. International Journal of Molecular Sciences. https://doi.org/10.3390/ijms24021659

Rocchi, Antonella, Teesalu, Tambet, Celia, Christian (2025). Advancing Cancer-Targeted Nanotherapies with Tumor Homing Peptides. Unknown. https://doi.org/10.1021/acsptsci.5c00241

Shao, Wenwei, Sun, Junjiang, Chen, Xiaojing, Dobbins, Amanda, Merricks, Elizabeth P., Samulski, R. Jude, et al. (2022). Chimeric Mice Engrafted With Canine Hepatocytes Exhibits Similar AAV Transduction Efficiency to Hemophilia B Dog. Unknown. https://doi.org/10.3389/fphar.2022.815317

Shen, Junjie, Gong, Liangzhi, Sun, Yi, Lin, Junqing, Hu, Wencheng, Wei, Jiabao, et al. (2024). Semaphorin3C identified as mediator of neuroinflammation and microglia polarization after spinal cord injury. iScience. https://doi.org/10.1016/j.isci.2024.109649

Song, Jiaxi, Cui, Dan, Wang, Jiabao, Qi, Xuewei, Ma, Jiaxin, Liu, Qing, et al. (2026). Targeting Mettl8-Tcf1 axis promotes CD8+ TPEX differentiation and antitumor immunity. Unknown. https://doi.org/10.1084/jem.20250424

Song, Aoqi, Wu, Zhenghua, Zhao, Wenjuan, Shi, Wenqing, Cheng, Ru, Jiang, Jingjing, et al. (2022). The Role and Mechanism of Hyperoside against Depression-like Behavior in Mice via the NLRP1 Inflammasome. Unknown. https://doi.org/10.3390/medicina58121749

Srivastava, Arun (2023). Rationale and strategies for the development of safe and effective optimized AAV vectors for human gene therapy. Unknown. https://doi.org/10.1016/j.omtn.2023.05.014

Tanaka, Masashi, Saito, Satoshi, Inoue, Takayuki, Satoh-Asahara, Noriko, Ihara, Masafumi (2020). Potential Therapeutic Approaches for Cerebral Amyloid Angiopathy and Alzheimer's Disease. International Journal of Molecular Sciences. https://doi.org/10.3390/ijms21061992

Tiliwaerde, Murezati, Gao, Nana, Yang, Yaqi, Jin, Zengliang (2024). A novel NMDA receptor modulator: the antidepressant effect and mechanism of GW043 . CNS Neurosci Ther. https://doi.org/10.1111/cns.14598

Uddin, Md. Sahab, Kabir, Md. Tanvir, Niaz, Kamal, Jeandet, Philippe, Clement, Christophe, Mathew, Bijo, et al. (2020). Molecular Insight into the Therapeutic Promise of Flavonoids against Alzheimer's Disease. Molecules. https://doi.org/10.3390/molecules25061267

Venkataraman, Srividhya, Shahgolzari, Mehdi, Yavari, Afagh, Hefferon, Kathleen, Peng, Qiang (2025). Bacteriophages as Targeted Therapeutic Vehicles: Challenges and Opportunities. Bioengineering. https://doi.org/10.3390/bioengineering12050469

Wang, Sue, Hu, Suiyu, Zhang, Chunhu, Qiu, Juan, Li, Yunhui (2014). Antidepressant-like activity of Chaihu-Shugan-San aqueous extract in rats and its possible mechanism. Unknown. https://doi.org/10.4103/0973-1296.127342

Wang, Mei, Sun, Junjiang, Crosby, Ashley, Woodard, Kenton, Hirsch, Matthew L., Samulski, R. Jude, et al. (2017). Direct Interaction of Human Serum Proteins with AAV Virions to Enhance AAV Transduction: Immediate Impact on Clinical Applications. Unknown. https://doi.org/10.1038/gt.2016.75

Weber, Thomas (2021). Anti-AAV Antibodies in AAV Gene Therapy: Current Challenges and Possible Solutions. Unknown. https://doi.org/10.3389/fimmu.2021.658399

Winston, Stephen M., Wiggins, Kristin B., Schultz-Cherry, Stacey, Davidoff, Andrew M., Mukhopadhyay, Suchetana (2025). Teaching an old vector new tricks: the surprising versatility of AAV vaccines. Journal of Virology. https://doi.org/10.1128/jvi.00730-25

Xia, Qian, Gao, Shuai, Han, Tangrui, Mao, Meng, Zhan, Gaofeng, Wang, Yonghong, et al. (2022). Sirtuin 5 aggravates microglia-induced neuroinflammation following ischaemic stroke by modulating the desuccinylation of Annexin-A1. Journal of Neuroinflammation. https://doi.org/10.1186/s12974-022-02665-x

Xiao, Shining, Zhang, Yu, Liu, Zihao, Li, Anan, Tong, Weilai, Xiong, Xu, et al. (2023). Alpinetin inhibits neuroinflammation and neuronal apoptosis via targeting the JAK2/STAT3 signaling pathway in spinal cord injury. CNS Neuroscience and Therapeutics. https://doi.org/10.1111/cns.14085

Xing, Xiaohua, Liu, Mingxin, Ouyang, Jiahe, Tang, Yaxin, Shan, Baozhen, Tang, Ruijing, et al. (2026). Mass spectrometry-based de novo sequencing reveals non-canonical neoantigens with antitumor efficacy in hepatocellular carcinoma. Unknown. https://doi.org/10.1016/j.jhepr.2026.101775

Xu, Yuan, Chen, Ronghua, Zhi, Feng, Sheng, Shiying, Khiati, Leena, Yang, Yilin, et al. (2023). delta-opioid Receptor, Microglia and Neuroinflammation. Aging and Disease. https://doi.org/10.14336/AD.2022.0912

Yang, Ruicheng, Yang, Bo, Liu, Wei, Tan, Chen, Chen, Huanchun, Wang, Xiangru (2023). Emerging role of non-coding RNAs in neuroinflammation mediated by microglia and astrocytes. Journal of Neuroinflammation. https://doi.org/10.1186/s12974-023-02856-0

Yang, Nan-Shi-Yu, Zhong, Wen-Jing, Sha, Han-Xi, Zhang, Chen-Yu, Jin, Ling, Duan, Jia-Xi, et al. (2024). mtDNA-cGAS-STING axis-dependent NLRP3 inflammasome activation contributes to postoperative cognitive dysfunction induced by sevoflurane in mice. International Journal of Biological Sciences. https://doi.org/10.7150/ijbs.91543

Yang, Tong-yuan, Braun, Manuela, Lembke, Wibke, McBlane, Fraser, Kamerud, John, DeWall, Stephen, et al. (2022). Immunogenicity assessment of AAV-based gene therapies: An IQ consortium industry white paper. Unknown. https://doi.org/10.1016/j.omtm.2022.07.018

Yin, Ling, He, Hongliang, Zhang, Hongliang, Shang, Yuhua, Fu, Chengbo, Wu, Songquan, et al. (2025). Revolution of AAV in Drug Discovery: From Delivery System to Clinical Application. Unknown. https://doi.org/10.1002/jmv.70447

Young, Alexander P., Denovan-Wright, Eileen M. (2021). The Dynamic Role of Microglia and the Endocannabinoid System in Neuroinflammation. Frontiers in Pharmacology. https://doi.org/10.3389/fphar.2021.806417

Zeng, Jiuseng, Chen, Li, Peng, Xi, Luan, Fei, Hu, Jingwen, Xie, Zhiqiang, et al. (2024). The anti-depression effect and potential mechanism of the petroleum ether fraction of CDB: Integrated network pharmacology and metabolomics. Unknown. https://doi.org/10.1016/j.heliyon.2024.e28582

Zengel, James, Esterman, Emma S., Ponnuswami, Anitha, Wall, Nicholas R., Carette, Jan E. (2025). Development of cell lines with increased susceptibility to diverse adeno-associated viral vectors to enable in vitro potency assays. Unknown. https://doi.org/10.1016/j.omtm.2025.101416

Zhan, Junbiao, Zhang, Yuxi, Tian, Hao, Zhu, Xizi, Guo, Manlin, Qiu, Zhen, et al. (2025). Mechanism of antidepressant action of ketamine and differences between its two enantiomers. Unknown. https://doi.org/10.1016/j.jatmed.2025.06.001

Zhang, Yufei, Xiong, Wei, Qian, Jianfei, Duan, Rui, Guo, Qi, Wang, Qiang, et al. (2026). T cells engineered against Dickkopf-1-A2 complex can be used to treat HLA-A2+ solid and hematologic cancers. Unknown. https://doi.org/10.1038/s41467-026-69621-8

Zhang, Yuefeng, Zhang, Qian, Luo, Yue, Li, Xiaohan, Zhao, Ruirui, Xu, Yaning, et al. (2026). Bifidobacterium breve inhibits colorectal cancer via extracellular vesicles containing formate acetyltransferase. Unknown. https://doi.org/10.1186/s12951-026-04275-8

Zhang, Zhentao, Su, Dan, Lai, Meixizi, Song, Yonggui, Li, Huizhen, Yang, Ming, et al. (2024). New antidepressant mechanism of Yueju Pill: Increasing ghrelin level by inhibiting gastric mTOR/S6K signaling pathway and sensitizing hippocampal GHS-R. Unknown. https://doi.org/10.1016/j.heliyon.2024.e37038

Zheng, Linxiao, Shuai, Wen, Liu, Yinyang, Deng, Yang, Bao, Ji, Hu, Xiuying, et al. (2026). The application of experimental models for the drug discovery for digestive tumors. Unknown. https://doi.org/10.1186/s12943-025-02558-6

Zhou, Heng-Jun, Wang, Li-Qing, Zhan, Ren-Ya, Zheng, Xiu-Jue, Zheng, Jie-Sheng (2022). lncRNA MEG3 restrained the M1 polarization of microglia in acute spinal cord injury through the HuR/A20/NF-kB axis. Brain Pathology. https://doi.org/10.1111/bpa.13070

Zhou, Jiayu, Wu, Ziyi, Zhao, Ping (2024). Luteolin and its antidepressant properties: From mechanism of action to potential therapeutic application. Unknown. https://doi.org/10.1016/j.jpha.2024.101097

Zhu, Wenhui, Li, Weifen, Jiang, Jian, Wang, Dilong, Mao, Xinliang, Zhang, Jin, et al. (2023). Chronic salmon calcitonin exerts an antidepressant effect via modulating the p38 MAPK signaling pathway. Unknown. https://doi.org/10.3389/fnmol.2023.1071327

Reproducibility: Skill File

Use this skill file to reproduce the research with an AI agent.

---
name: hidden-assumptions-research
description: "Use this skill to conduct a meta-research study that identifies hidden assumptions in biomedical research papers. The skill guides an AI agent through a complete research pipeline: assembling a corpus of high-impact papers from targeted biomedical subfields via PubMed Central, compressing them into enriched abstracts, systematically generating the conditions of validity required for each paper's conclusions to hold, identifying which conditions are unstated (hidden assumptions), detecting shared blind spots across papers in the same subfield, and assessing whether those assumptions have been independently verified in the literature. The final output is a research paper accompanied by structured data. Trigger this skill when the user asks to run the hidden assumptions analysis, generate the assumption vulnerability report, conduct the meta-research study, or produce the research paper on implicit assumptions in biomedical literature. Also trigger when the user mentions mapping assumptions, finding blind spots in research, or analyzing evidence-to-conclusion gaps."
---

# Hidden Assumptions in Biomedical Research — Research Skill

## Problem Statement

Biomedical research papers routinely depend on implicit assumptions to connect reported evidence to stated conclusions, yet no systematic framework exists to identify, classify, or evaluate these hidden reasoning dependencies — which may constitute a significant and undercharacterized mechanism contributing to the field's reproducibility crisis.

## Research Objective

Develop and apply an AI-driven framework that systematically identifies unstated assumptions in high-impact biomedical research, maps shared blind spots across subfields, and evaluates whether those assumptions have been independently verified in the literature. Produce a verification status report for each subfield and a complete research paper presenting the findings.

## Pipeline Overview

```
Stage 1: Corpus Assembly → papers.json
Stage 2: Paper Compression → enriched_abstracts.json
Stage 3: Evidence-Conclusion Extraction → evidence_conclusion_pairs.json
Stage 4: Assumption Generation → assumptions_per_paper.json
Stage 5: Blind Spot Detection → shared_assumptions.json
Stage 6: Verification Assessment → verification_report.json
Stage 7: Paper Generation → paper.md + data files
```

Execute each stage sequentially. Each stage produces a JSON output that feeds the next. Save all outputs to a `data/` directory. Do a sanity check between stages.

---

## Stage 1: Corpus Assembly

**Goal:** Build a corpus of high-impact open-access papers across biomedical subfields.

### Subfield Selection

Target subfields where hidden assumptions are most likely to cause real problems. Use these criteria:
- Heavy reliance on model systems or proxies (animal models, cell lines, biomarkers)
- Frequent translational claims extending basic science to clinical implications
- Complex, incompletely understood biology that invites simplifying assumptions
- Documented history of replication failures or clinical trial disappointments

**Default subfields (select 3–5 depending on scope):**

| Subfield | Why vulnerable |
|---|---|
| Neurodegeneration (Alzheimer's, Parkinson's) | Transgenic mouse models, surrogate biomarkers, contested frameworks like amyloid hypothesis, decades of clinical trial failures |
| Neuroinflammation | Assumptions about microglial function, BBB models, inflammatory markers that may not translate |
| Immuno-oncology | Mouse tumor models to human therapy, context-dependent tumor microenvironment |
| Psychiatric pharmacology | Neurotransmitter hypotheses increasingly questioned, behavioral animal models with debated construct validity |
| Gene therapy / AAV therapeutics | Assumptions about vector tropism, immune response, dose-response from limited early data |

### Paper Retrieval via PubMed Central APIs

Use the PMC Open Access subset. These are the only papers where full text can be programmatically retrieved.

**Step 1: Search for papers using NCBI E-utilities.**

```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pmc&term={QUERY}+AND+open+access[filter]&retmax=50&sort=relevance&retmode=json
```

Example queries per subfield:
- Neurodegeneration: `"Alzheimer's disease" AND "mouse model" AND "therapeutic"`
- Neuroinflammation: `"neuroinflammation" AND "microglia" AND "mechanism"`
- Immuno-oncology: `"immuno-oncology" OR "immune checkpoint" AND "tumor model"`
- Psychiatric pharmacology: `"antidepressant" OR "SSRI" AND "animal model" AND "mechanism"`
- Gene therapy: `"AAV" AND "gene therapy" AND "clinical" OR "translational"`

The response returns PMCIDs. Extract the ID list from the JSON response.

**Step 2: Retrieve full text using the BioC API.**

For each PMCID:
```
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{PMCID}/unicode
```

This returns structured JSON with sections (abstract, introduction, methods, results, discussion) already parsed. No PDF processing needed.

Rate limit: stay under 3 requests per second. Add a delay between requests.

**Step 3: Filter and select.**

From the retrieved papers, select ~20 per subfield (or fewer if testing). Prioritize:
- Papers that make substantive claims (mechanistic, therapeutic, generalizable) over purely descriptive or methods papers
- Higher citation counts and reputable journals
- Papers from the last 5–10 years

Record metadata: title, authors, year, journal, DOI, PMCID, subfield.

**Output:** Save as `data/papers.json` — array of objects with fields: `id`, `title`, `authors`, `year`, `journal`, `doi`, `pmcid`, `subfield`, `abstract`, `full_text` (concatenated from BioC sections).

---

## Stage 2: Paper Compression into Enriched Abstracts

**Goal:** Compress each paper into a structured 500–800 word enriched abstract built for assumption analysis. This is not a general summary — it is an analytical intermediate preserving exactly the information needed for downstream reasoning analysis.

### Prompt Template

For each paper, use this prompt. Replace `{placeholders}` with actual content.

```
You are a research analyst. Your task is to compress a biomedical research paper
into a structured enriched abstract optimized for downstream reasoning analysis.

You must EXTRACT information faithfully — do not interpret, evaluate, or critique.
Preserve the authors' own framing and language for claims.

Here is the paper:
---
Title: {title}
Abstract: {abstract}
Full Text: {full_text}
---

Produce an enriched abstract (500–800 words) with exactly these five sections:

**1. Model System and Methods**
What organism, cell line, patient population, or model was used? What were the key
experimental conditions and techniques? Include enough detail to understand the
boundaries of what this evidence can support. Focus on the primary experimental
system and any choices that constrain interpretation.

**2. Core Evidence**
What were the specific findings? Report effect sizes, key measurements, quantitative
observations, and primary data points. Describe what was actually measured and
observed, not what was concluded from it. Focus on findings most central to the
main conclusion.

**3. Reasoning and Interpretation**
How do the authors connect their evidence to their conclusion? What explicit
justifications, mechanistic arguments, or logical steps do they articulate in the
discussion? Include causal reasoning, comparisons to prior work, or theoretical
frameworks they invoke.

**4. Conclusion and Claim Scope**
State the paper's main claim in one or two sentences. Then characterize the scope:
do the authors limit their claim to the specific experimental context, or do they
generalize to broader biology, other species, clinical relevance, or therapeutic
potential? Quote or closely paraphrase the key concluding statement.

**5. Acknowledged Limitations**
What caveats do the authors themselves identify? If no limitations are explicitly
stated, write "No limitations explicitly acknowledged."

Think step by step through each section. For each section, first identify the
relevant passages in the paper, then extract the key information, then compose
the section.
```

### Few-Shot Example (include with the prompt for calibration)

```
EXAMPLE OUTPUT for a hypothetical paper on TREM2 and tau pathology:

**1. Model System and Methods**
The study used PS19 transgenic mice expressing human P301S mutant tau, crossed
with TREM2 knockout mice, aged to 9 months. Brain tissue was analyzed via
immunohistochemistry for phosphorylated tau (AT8), microglial activation (Iba1),
and synaptic density (synaptophysin). Behavioral assessment used Morris water maze.

**2. Core Evidence**
TREM2-deficient PS19 mice showed 2.3-fold greater AT8-positive tau burden in
hippocampus vs controls (p<0.01). Microglial density reduced by 40% in TREM2-KO.
Phagocytic activity decreased 45% in vitro (p<0.05). Synaptophysin reduced 35%
in CA1. TREM2-KO mice showed impaired spatial memory (latency +15s, p<0.01).

**3. Reasoning and Interpretation**
Authors argue TREM2 is required for microglial clearance of pathological tau. They
cite prior work on TREM2's role in amyloid clearance and propose an analogous
mechanism. Behavioral deficit interpreted as consequence of hippocampal tau burden
and synapse loss.

**4. Conclusion and Claim Scope**
Concludes "TREM2 plays a protective role against tau-mediated neurodegeneration
through microglial phagocytic clearance." Scope extends beyond mouse model —
authors state this "supports TREM2 as a potential therapeutic target for
tauopathies including Alzheimer's disease."

**5. Acknowledged Limitations**
Authors note PS19 overexpresses mutant tau at non-physiological levels, TREM2-KO
is global rather than microglia-specific, and soluble tau species were not measured.
No discussion of whether mouse microglial response parallels human behavior.
```

**Output:** Save as `data/enriched_abstracts.json` — array with fields: `paper_id`, `model_system_and_methods`, `core_evidence`, `reasoning_and_interpretation`, `conclusion_and_claim_scope`, `acknowledged_limitations`, `full_enriched_abstract`.

---

## Stage 3: Evidence and Conclusion Extraction

**Goal:** Isolate the core evidence-to-conclusion pair for each paper.

### Prompt Template

```
You are extracting the core evidence-to-conclusion relationship from an enriched
abstract of a biomedical paper.

Here is the enriched abstract:
---
{enriched_abstract}
---

Extract the following. Think step by step — first identify the main claim, then
trace what evidence directly supports it, then classify.

1. **Primary conclusion:** The main claim as a single clear sentence.

2. **Supporting evidence:** Specific findings that support this conclusion.
Be concrete — include numbers, measurements, observations.

3. **Evidence type:** One of: in vitro / in vivo animal model / human
observational / human experimental / computational / mixed (specify).

4. **Conclusion type:** One of: mechanistic / therapeutic / prognostic /
general biological principle / methodological / mixed (specify).

5. **Claim scope:** One of:
- narrow: limited to specific experimental context
- moderate: extends somewhat beyond data with hedging
- broad: makes general, translational, or clinical claims
```

**Output:** Save as `data/evidence_conclusion_pairs.json` — array with fields: `paper_id`, `primary_conclusion`, `supporting_evidence`, `evidence_type`, `conclusion_type`, `claim_scope`.

---

## Stage 4: Assumption Generation via Conditions of Validity

**Goal:** Identify hidden assumptions by generating what must be true for the conclusion to follow from the evidence, then checking which conditions the paper addresses.

This is the core analytical step and the most important stage. The agent does not look for mistakes or obvious errors. It generates dependencies — conditions that must hold for the evidence-to-conclusion bridge to be sound — and identifies which ones are unstated.

### Prompt Template

```
You are a scientist from OUTSIDE this research field reviewing a biomedical paper.
You do not share this field's default assumptions. Your task is to identify what
must be true — but is not established in this paper — for the conclusion to follow
from the evidence.

Here is the evidence-conclusion pair:
---
Primary conclusion: {conclusion}
Supporting evidence: {evidence}
Evidence type: {evidence_type}
Conclusion type: {conclusion_type}
Claim scope: {claim_scope}
Model system: {model_system_from_enriched_abstract}
Acknowledged limitations: {limitations_from_enriched_abstract}
---

STEP 1: GENERATE CONDITIONS OF VALIDITY

Ask yourself: "What conditions must be true for this specific conclusion to hold,
given only this specific evidence?"

Generate conditions across these dimensions:
- Biological: Does the proposed mechanism operate as assumed? Alternative explanations?
- Translational: Does this transfer across species, models, populations?
- Temporal: Would this hold over longer time periods or developmental stages?
- Contextual: Would different conditions, genetic backgrounds, or environments change it?
- Measurement: Does the proxy actually reflect the construct being studied?
- Mechanistic: Is the proposed causal pathway the best explanation, or could others
produce the same observation?

Number of conditions to generate:
- Narrow claim scope: 5–8 conditions
- Moderate claim scope: 8–12 conditions
- Broad claim scope: 12–18 conditions

STEP 2: FILTER FOR SUBTLETY

This step is critical. REMOVE any condition that:
- A competent peer reviewer in the field would obviously catch
- Is already acknowledged in the paper's limitations section
- Is trivially true for all research ("equipment must be calibrated")
- Is a surface-level methodological concern ("sample size could be larger")

KEEP conditions that:
- Reflect premises the field takes for granted but has not rigorously established
- Would only be visible to someone outside the field's conventional thinking
- Represent genuine dependencies where, if the condition turned out to be false,
the conclusion would not hold
- Are the kind of assumption that could explain why a finding fails to replicate
in a different context

Think about it this way: if this assumption turned out to be wrong, would the
entire line of research be undermined? Those are the ones that matter.

STEP 3: CLASSIFY EACH REMAINING CONDITION

For each condition:
- Status: Is this addressed in the paper? (established / hidden assumption)
- Category: translational / causal / generalization / temporal / mechanistic /
measurement / contextual
- Reasoning: One sentence on why this condition matters for the conclusion
- Importance: high / medium / low — based on how consequential it would be if
this assumption turned out to be false

Focus your output on the HIGH and MEDIUM importance assumptions.

Output as a structured list.
```

### Prompting Guidance

The "outside scientist" framing is the most important design choice. It counteracts the tendency to adopt the default assumptions of whatever field is being analyzed. Without this frame, the agent will generate assumptions the field already knows about and considers obvious.

Chain-of-thought reasoning is essential — the step-by-step process is what produces the analytical depth needed for non-obvious assumptions. Do not skip the thinking steps.

The subtlety filter prevents the output from being cluttered with trivial observations. The goal is to surface the assumptions that are invisible because they are shared by the community — not the ones that are visible but ignored.

**Output:** Save as `data/assumptions_per_paper.json` — array with fields: `paper_id`, `subfield`, `conditions_of_validity` (array of objects with `condition`, `status`, `category`, `reasoning`, `importance`).

---

## Stage 5: Cross-Paper Blind Spot Detection

**Goal:** Find hidden assumptions shared across multiple papers within a subfield — the systemic blind spots. Focus on the most important assumptions.

### Prompt Template

```
You are analyzing hidden assumptions identified across multiple papers in the
subfield of {subfield}.

Here are the hidden assumptions, grouped by paper (only HIGH and MEDIUM importance):
---
{list of hidden assumptions with paper_ids}
---

STEP 1: SEMANTIC CLUSTERING

Group assumptions that express the same underlying dependency, even if worded
differently. For example:
- "The APP/PS1 mouse model recapitulates human AD amyloid pathology" and
"Transgenic amyloid mice reflect human disease progression" = SAME dependency
- "Microglial activation is inherently neuroinflammatory" and "Activated microglia
contribute to neurodegeneration" = SIMILAR but check if core premise is the same

Be conservative: only cluster assumptions that genuinely share the same core premise.

STEP 2: SUMMARIZE AND RANK

For each cluster:
- Write a single clear statement capturing the shared assumption (specific enough
to be testable)
- List which papers share it
- Calculate frequency (count and percentage of papers in the subfield)

STEP 3: FLAG SYSTEMIC BLIND SPOTS

Mark any cluster appearing in more than 20% of papers as a systemic blind spot.
Prioritize clusters that are both high-frequency AND high-importance.

STEP 4: CROSS-SUBFIELD CHECK

Note if any blind spots appear across multiple subfields — these represent
field-wide vulnerabilities in biomedical research practice.
```

**Output:** Save as `data/shared_assumptions.json` — object keyed by subfield, each containing an array of assumption clusters with fields: `assumption_summary`, `category`, `paper_ids`, `frequency_count`, `frequency_percent`, `is_systemic_blind_spot`, `importance`, `also_appears_in_subfields`.

---

## Stage 6: Verification Status Assessment

**Goal:** For each systemic blind spot, do a surface-level search to determine whether it has been independently tested.

This is a preliminary assessment — not a comprehensive systematic review. The goal is to classify each assumption and identify the most critical research gaps.

### Prompt Template

```
You are assessing whether a hidden assumption commonly held across biomedical
research papers has been independently verified.

The assumption is:
---
"{assumption_summary}"
Subfield: {subfield}
Shared by {frequency} papers in the corpus.
---

STEP 1: FORMULATE SEARCH TERMS

What specific search terms would locate studies that DIRECTLY test or validate
this assumption? Target validation studies, systematic comparisons, or dedicated
investigations — not papers that merely rely on the same assumption.

STEP 2: SEARCH AND CLASSIFY

Based on available evidence from PubMed and related databases, classify:

- Verified: Direct, robust evidence supports this assumption (systematic reviews,
multiple validation studies).
- Contested: Evidence is mixed — some studies support, others challenge, or it
holds only under certain conditions.
- Untested: No study has directly examined this. Taken for granted by the field.

STEP 3: DOCUMENT

- Verification status: [verified / contested / untested]
- Evidence summary: [2–3 sentences on what evidence exists or does not exist]
- Key references: [list relevant studies if found]
- Research priority: [low / medium / high — based on how many papers depend on
this assumption and how consequential it would be if wrong]
```

For the search step, use web search tools if available. If not, use the agent's training knowledge to assess whether the assumption has been studied, and note that verification is based on existing knowledge rather than a live literature search.

**Output:** Save as `data/verification_report.json` — array with fields: `assumption_summary`, `subfield`, `verification_status`, `evidence_summary`, `key_references`, `research_priority`.

---

## Stage 7: Paper and Data Generation

**Goal:** Produce the final research paper and summary data.

### Data Outputs

Save all JSON files from prior stages to the `data/` directory. Additionally, generate:

`data/summary_statistics.csv` — Aggregated counts:
- Total papers per subfield
- Total assumptions generated, total hidden assumptions
- Hidden assumptions per paper (mean, by subfield)
- Category distribution (translational, causal, generalization, temporal, mechanistic, measurement, contextual)
- Number of systemic blind spots per subfield
- Verification status breakdown (verified / contested / untested counts and percentages)

### Paper Structure

Generate the paper as `paper.md`. Follow this structure exactly:

**Title:** "Mapping Hidden Assumptions in Biomedical Research: An AI-Driven Framework for Identifying Unstated Dependencies Between Evidence and Conclusions"

**Abstract (250–300 words):**
1. Problem: reproducibility crisis + underexplored role of implicit assumptions (2–3 sentences)
2. Objective: what this study does (1–2 sentences)
3. Methods: the pipeline, corpus size, subfields analyzed (2–3 sentences)
4. Key findings: total hidden assumptions, most common categories, systemic blind spots, verification status breakdown (3–4 sentences)
5. Significance: scalable framework, concrete research priorities (1–2 sentences)

**1. Introduction (800–1200 words):**

Build the argument in this order:

1.1 The reproducibility crisis is real and costly.
- >50% of preclinical research irreproducible (Freedman et al., 2015)
- ~$28B/year on irreproducible preclinical research in US alone
- 72% of biomedical researchers agree crisis exists (Cobey et al., 2024)
- >70% of scientists have tried and failed to replicate others' work (Baker, 2016)

1.2 Known causes have been studied but don't fully explain the problem.
- Statistical misuse, publication bias, reagent issues (>32,000 papers affected by cell line misidentification per Horbach & Halffman, 2017)
- These have received significant attention

1.3 Hidden assumptions are an underexplored mechanism.
- Conclusions depend on unstated premises shared across a community
- Peer review cannot catch them because reviewers share the same assumptions
- Landmark examples: mouse inflammation models (Seok et al., 2013), cell line misidentification, intrinsically disordered proteins
- Internal validity was fine — the problem was unstated premises

1.4 No systematic framework exists.
- Philosophers described the problem conceptually; no one has applied a scalable method
- AI advantage: trained across fields, doesn't share any single field's default assumptions

1.5 This study.
- Research question: To what extent do conclusions in biomedical research papers depend on implicit assumptions not directly supported by the reported evidence?
- Objective and brief preview

**2. Methods (1000–1500 words):**

Present each pipeline stage clearly: subfield selection criteria, corpus assembly via PMC APIs, enriched abstract compression (explain it's an analytical intermediate, not a summary), evidence-conclusion extraction, conditions-of-validity generation (explain the "outside scientist" frame, the subtlety filter, scaling by claim scope), cross-paper clustering, verification status assessment. Reference the data files.

**3. Results (1500–2500 words):**

3.1 Corpus overview: papers analyzed, breakdown by subfield, evidence/conclusion type distributions
3.2 Assumption generation: total generated, total hidden, average per paper by subfield, category distributions, correlation between claim scope and assumption count
3.3 Systemic blind spots by subfield: top 3–5 per subfield with concrete examples from specific papers
3.4 Cross-subfield patterns: dominant categories, assumptions shared across subfields
3.5 Verification status: percentage verified/contested/untested, highlight the most important untested and contested assumptions with 2–3 detailed examples

**4. Discussion (1000–1500 words):**

4.1 Summary of key findings
4.2 Why hidden assumptions persist (shared conventions, peer review selects for insiders, incentive structures)
4.3 Value of the framework (first systematic attempt, generalizable, AI as assumption detector)
4.4 Implications:
- Researchers: explicitly state assumptions conclusions depend on
- Reviewers: use conditions-of-validity checklist
- Funders: untested assumptions = concrete research priorities
- Journals: consider requiring an "assumptions" section
4.5 Limitations (be honest and specific):
- AI may generate false positive assumptions
- Enriched abstracts may lose nuance
- Clustering involves judgment calls
- Verification assessment is surface-level
- Corpus is a sample, not comprehensive
- Framework identifies assumptions, does not evaluate their correctness
4.6 Future directions: larger corpora, expert validation, longitudinal tracking, tool for researchers

**5. Conclusion (200–300 words):**
Restate problem, key finding, contribution, and frame as contribution to science's self-correcting nature — not a critique of individual researchers.

**References:**
Include at minimum:
- Baker, M. (2016). Nature, 533, 452–454.
- Cobey, K. D., et al. (2024). PLOS Biology, 22(11), e3002870.
- Freedman, L. P., et al. (2015). PLOS Biology, 13(6), e1002165.
- Horbach, S. P. J. M., & Halffman, W. (2017). PLOS ONE, 12(10), e0186281.
- Ioannidis, J. P. A. (2005). PLOS Medicine, 2(8), e124.
- Munafò, M. R., et al. (2017). Nature Human Behaviour, 1, 0021.
- Seok, J., et al. (2013). PNAS, 110(9), 3507–3512.
Add all additional references found during verification assessment.

**Formatting notes:**
- Use hedged language: "suggests," "may contribute," "our analysis indicates"
- Reference data files inline so readers can trace claims
- Include at minimum: Table 1 (corpus overview), Table 2 (assumption category distribution by subfield), Table 3 (top systemic blind spots with verification status)

---

## Quality Checks

After completing the pipeline, verify:

1. **Coverage:** Every paper has an enriched abstract, evidence-conclusion pair, and assumption list.
2. **Consistency:** Assumption categories use the defined taxonomy across all papers.
3. **Calibration:** Spot-check 3–5 papers per subfield. Do the conditions of validity make sense? Are classifications reasonable?
4. **Non-triviality:** Systemic blind spots should be genuine field-level dependencies, not obvious methodological limitations.
5. **Verification specificity:** Search queries target the actual assumption, not a vaguely related topic.

---

## Troubleshooting

**Full text not available for a paper:** Use the abstract plus any accessible information. Note this in the enriched abstract. Acceptable for a portion of the corpus — flag in the paper's limitations.

**Too few papers in a subfield:** Broaden search terms or extend time window. If fewer than 10 usable papers, consider replacing the subfield (alternatives: microbiome-gut-brain axis, CAR-T cell therapy, epigenetic therapies).

**Assumptions feel too generic:** Re-run Stage 4 with more emphasis on the specific evidence and model system. Generic assumptions like "the model may not reflect reality" are not useful — push for specifics about which aspect of reality and why it matters.

**Clustering is noisy in Stage 5:** Use a two-pass approach — generate candidate clusters, then refine by merging or splitting with explicit rationale. Focus on high-importance assumptions only.

**API rate limits:** Add 0.5s delay between PMC API requests. If a request fails, retry after 2 seconds. Stay under 3 requests per second.