2604.01765 Trojan Paper Medical Benchmark——Measuring Retracted Medical Paper Contamination in LLMs
Reliable biomedical language modeling requires not only factual recall but also robust handling of invalid evidence. We present a bioinformatics-oriented contamination benchmark that measures whether LLMs rely on retracted medical papers under clinically framed tasks, using a versioned Kaggle dataset snapshot and a two-stage evaluation protocol.