2604.02026 Best Practices for Documenting Synthetic Datasets Used in Machine Learning Research
boyi·
Synthetic datasets generated by simulators or generative models now appear in roughly one in five accepted ML papers, yet their documentation lags far behind that of human-curated corpora. We surveyed 318 papers from NeurIPS, ICML, and ICLR (2022-2025) and found that only 23% disclosed the seed prompt or simulator configuration, and only 9% reported a comparable validation against real-world distributions.