ORVS with Corpus-Curated Semantic Retrieval: Structured Verification Resolves the Knowledge Retrieval Paradox in Specialist Rheumatology AI
ORVS with Corpus-Curated Semantic Retrieval for Rheumatology Clinical AI
Problem
Specialist clinical AI faces two failures: (1) hallucination of diagnoses, drug interactions, and classification criteria at 12-15% baseline, and (2) the Knowledge Retrieval Paradox, where adding retrieval-augmented generation with general-purpose embeddings degrades specialist performance below the unaugmented model.
Verification Architecture
Each clinical query produces a candidate response treated as a hypothesis. A structured evaluator scores it on four weighted dimensions:
- Clinical Accuracy (0.30): correct diagnosis, evidence citations, classification criteria adherence
- Safety (0.30): contraindications, drug interactions, monitoring gaps, urgent escalation
- Therapeutic Management (0.20): dose adjustment, temporal protocols (2w/4w/12w/6mo milestones)
- Resource Stewardship (0.20): proportionate investigation, full therapeutic arsenal without institutional assumptions
Responses below threshold (9.0/10) receive specific feedback — "no renal dose adjustment for mycophenolate" or "missing cervical spine screening in RA" — and are regenerated. Maximum three cycles before human escalation. Fewer than 8% of queries required more than one cycle.
The processing graph enforces deterministic ordering with timestamped, immutable artefacts at each stage — generation, retrieval, verification, augmentation — producing an auditable trace.
Retrieval Pipeline
The corpus comprises 81,502 rheumatology articles embedded at 1,024 dimensions (text-embedding-3-large). PCA on this corpus reveals ordered variance decay:
| Dimensions | Variance | Bit Precision | Semantic Content |
|---|---|---|---|
| 1-128 | 68% | 6-bit | Disease entities, anatomy, therapeutics |
| 129-512 | 25% | 4-bit | Comorbidity patterns, temporal trajectories |
| 513-1024 | 7% | 2-bit | Contextual nuance |
This compresses 335 MB to 39 MB (8.5x). Recall at 10 passages: 95%. Generic random rotation on the same corpus: 87%. The 8-point deficit arises because random rotations distribute variance uniformly across dimensions, destroying the concentrated clinical signal in the top principal components.
Inference: coarse HNSW search over pgvector (under 50 ms) returns 50 candidates. Local re-ranking with full fine-grained similarity produces 10 final passages.
Evaluation
Seven protocols, 125 scenarios, five operational modes (unaugmented, single-pass verification, full verification loop, retrieval-only, combined system):
Protocol B — the paradox: Naive retrieval with unfiltered vector store scored 7.92 vs unaugmented 8.38. Cosine similarity of retrieved passages: 0.33-0.38 — noise, not signal.
Protocol C — curated retrieval: Full verification + curated knowledge base achieved SD 0.10 across scenarios vs 0.95 for retrieval-only. Variance reduction: 89%.
Protocol E — rare diseases (relapsing polychondritis, IgG4-RD, EGPA): Combined system 8.44 vs unaugmented 8.24 (paired permutation P=0.034, Hochberg-corrected). Largest gains for rarest conditions (delta +0.52).
Protocol F — TRUST-Bench v3 safety: Full pipeline vs single-pass: safety +7.3 points, escalation appropriateness +10.0 points, diagnostic accuracy +11.3 points. Catches omitted TB screening before biologics, unrecognised posterior reversible encephalopathy in lupus with acute hypertension.
Protocol G — combined system: 68% win rate vs unaugmented across 25 scenarios. Hallucination: under 2% vs 12-15%. Bayesian posterior P(superiority) = 0.89 (95% CI 0.82-0.94).
Conclusion
Verification and retrieval are jointly necessary. Verification alone cannot supply missing knowledge. Retrieval alone degrades specialist performance when embedding precision is inadequate. The Knowledge Retrieval Paradox is an artefact of retrieval imprecision, not an inherent limitation of augmented generation.
Authors
Zamora-Tehozol EA, DNAI, Meléndez-Córdoba A, Hernández-Gutiérrez RA, Arzápalo-Metri JI
References
- Zamora-Tehozol EA et al. ORVS: Verification-first clinical AI. RheumaAI Project, 2026.
- Liang Z et al. TurboQuant: online vector quantization. ICLR 2026.
- Barnett S et al. Seven failure points in RAG systems. arXiv:2401.05856, 2024.