ORVS: A Post-Generation Verification Loop with Corpus-Curated Retrieval for Rheumatology AI — Methods, Internal Evaluation, and Limitations
ORVS: A Post-Generation Verification Loop with Corpus-Curated Retrieval for Rheumatology AI
What This System Does
- Generate a candidate clinical response (GPT-4o with rheumatology system prompt)
- Score it on 4 dimensions via a separate evaluator prompt (also GPT-4o)
- If below threshold: provide specific feedback ("missing renal dose adjustment", "no TB screening before biologic") and regenerate
- Maximum 3 regeneration cycles
- Supply retrieved passages from a curated rheumatology corpus to augment generation context
Verification Dimensions
- Clinical Accuracy (weight 0.30): diagnosis, classification criteria, cited evidence
- Safety (weight 0.30): contraindications, drug interactions, urgent escalation
- Therapeutic Management (weight 0.20): dosing, monitoring, temporal milestones
- Resource Stewardship (weight 0.20): proportionate investigation, full therapeutic options
Composite = 0.30CLA + 0.30SAF + 0.20TMP + 0.20RSC
Retrieval: Why PCA Rotation Matters
The claim is NOT that reducing bit precision improves retrieval. The mechanism:
- General-purpose embeddings distribute representational capacity across all language
- Rheumatological distinctions (e.g., GPA vs EGPA, class III vs IV lupus nephritis) occupy a small region of this space
- PCA on the rheumatology corpus identifies which dimensions capture the most domain-specific variance
- Allocating higher bit precision to high-variance dimensions preserves fine clinical distinctions during compression
- Random rotation (the generic approach) distributes variance uniformly, destroying this concentrated signal
Result: 95% recall@10 with corpus-curated PCA vs 87% with random rotation on the same corpus. The improvement comes from better preservation of domain-relevant information, not from compression itself.
Evaluation Design and Limitations
Scenarios: 125 clinical vignettes across 7 protocols, designed by a board-certified rheumatologist (EAZ-T). These are constructed scenarios, not real patient encounters.
Scoring: All scoring performed by GPT-4o evaluator prompts, NOT by blinded human rheumatologists. This is a significant limitation — LLM-as-judge has known biases including preference for verbose responses and difficulty detecting subtle clinical errors.
Hallucination measurement: Manual review of factual claims against PubMed/UpToDate for 40 responses (subset). Hallucination defined as: fabricated citation, incorrect drug name/dose, wrong classification criteria, or invented clinical trial. Found in 12-15% of unaugmented responses, under 2% of verified responses. This is a small sample with no inter-rater reliability assessment.
Statistical reporting: In a previous version we reported "posterior probability 0.89 with 95% CI 0.82-0.94" which incorrectly conflates Bayesian credible intervals with frequentist confidence intervals. The correct statement: Bayesian estimation with weakly informative priors yields a posterior mean difference of +0.72 points (95% credible interval +0.02 to +1.38) favoring the combined system. The probability that the combined system exceeds unaugmented by at least 0.25 points is 0.89.
Self-correction limitation: The verification loop uses GPT-4o to evaluate GPT-4o. Huang et al. (2024) demonstrated that LLMs often cannot correct their own factual errors without external grounding. Our system partially mitigates this by providing external retrieved passages as grounding, but the evaluator itself may miss errors it would also generate. Independent human evaluation is needed.
Sample size: 125 scenarios across 7 protocols provides limited statistical power. Effect sizes are modest (mean difference +0.72 on a 10-point scale). We cannot exclude the possibility that observed improvements are partially attributable to evaluation noise.
Benchmarks are internal. All evaluation protocols were designed and executed by the authors. No external validation or replication exists. We do not reference any public benchmark by name that does not exist.
Results (with caveats)
| Mode | Mean Score | SD | n |
|---|---|---|---|
| Unaugmented GPT-4o | 8.18 | 0.45 | 125 |
| Verification only | 8.33 | 0.37 | 125 |
| Retrieval only (naive) | 7.92 | 0.95 | 25 |
| Combined (curated retrieval + verification) | 8.90 | 0.10 | 25 |
The naive retrieval result (7.92, worse than baseline) demonstrates that adding retrieval with general-purpose embeddings can harm specialist performance. The curated retrieval result (8.90) suggests but does not prove that domain-adapted retrieval resolves this. The sample sizes differ across conditions, limiting direct comparison.
What We Do Not Claim
- This is not independently validated
- Scoring by LLM evaluator is not equivalent to expert clinical judgment
- Self-correction by the same model has inherent limitations
- The sample size is insufficient for definitive conclusions
- No formal comparison against alternative approaches (human-in-the-loop, ensemble models, different LLMs)
Authors
Zamora-Tehozol EA, DNAI, Meléndez-Córdoba A, Hernández-Gutiérrez RA, Arzápalo-Metri JI
References
[1] Lewis P et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS 2020;33:9459-9474. [2] Madaan A et al. Self-refine: iterative refinement with self-feedback. NeurIPS 2023;36:46534-46594. [3] Huang J et al. Large language models cannot self-correct reasoning yet. ICLR 2024. [4] Barnett S et al. Seven failure points when engineering a retrieval augmented generation system. arXiv:2401.05856, 2024. [5] Chillotti I et al. TFHE: Fast fully homomorphic encryption over the torus. J Cryptol 2020;33:34-91. [6] Singhal K et al. Large language models encode clinical knowledge. Nature 2023;620:172-180.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.