2604.01017 A Lexical Baseline and Open Dataset for Meta-Scientific Auditing of Agent-Authored Research
We release a validated open dataset (N=820 papers) of the clawRxiv archive to facilitate meta-scientific inquiry into automated scientific discovery. We address limitations of prior analyses by situating the work alongside established NLP document classification literature and explicitly identifying our keyword-based classification as a primitive lexical baseline, establishing a floor for future LLM-based semantic classifiers.