Best Practices for Documenting Synthetic Datasets Used in Machine Learning Research
Best Practices for Synthetic-Data Documentation
1. Motivation
The past three years have seen rapid uptake of synthetic data in machine-learning research, motivated variously by privacy, scale, and the desire to probe failure modes that real data cannot easily expose [Nikolenko 2021]. Practitioner experience, however, suggests that downstream consumers of these datasets struggle to assess fitness-for-purpose: what was the generator? what real distribution does the synthetic distribution claim to mimic? what tests of fidelity were run?
We argue that synthetic data demands a documentation standard that is strictly stronger than those for human-curated data, because the generative process itself is a hypothesis under test. This paper offers:
- An empirical audit of disclosure practice in 318 recent papers.
- SD-CARDS, a documentation template specialized for synthetic data.
- A reader-study quantifying the impact of structured disclosure on fitness assessment.
2. Background
Datasheets for Datasets [Gebru et al. 2018] codified disclosure for human-curated data, covering motivation, composition, collection, and uses. Model Cards [Mitchell et al. 2019] addressed model artifacts. Neither was designed for the case where the data itself is the output of a model. Recent proposals such as DATACOMP and the synthetic-data appendix in [Liu et al. 2024] have begun this work but lack adoption.
3. Audit of Current Practice
We sampled 318 papers from the proceedings of three top-tier ML venues (2022-2025) that introduce or substantively rely on a synthetic dataset. Two annotators coded each paper for the presence of 14 disclosure items, with Cohen's .
3.1 Headline disclosure rates
| Item | Disclosure rate |
|---|---|
| Generator name and version | 71% |
| Generator seed / RNG state | 38% |
| Conditioning prompt or schema | 23% |
| Quantitative fidelity vs. real data | 9% |
| Known failure modes or omissions | 14% |
| License of the synthetic artifact | 28% |
| Re-generation cost (compute, USD) | 6% |
The striking finding is the asymmetry: generator identity is usually disclosed but generator configuration sufficient to reproduce the dataset is rare.
4. SD-CARDS: A Documentation Template
SD-CARDS extend Datasheets with synthetic-specific sections. The skeleton has six sections.
sd_card:
generator:
name: "diffusion-v3"
version: "3.2.1"
seed: 42
prompt_schema: "./prompts.jsonl"
population_target:
domain: "hand-X-rays, pediatric"
real_reference: "NIH-CXR8 train split"
distance_metric: "FID"
distance_value: 14.7
fidelity_tests:
- test: "per-class FID"
result: "see Table 3"
- test: "radiologist Turing test (n=12)"
result: "AUC=0.61, 95% CI [0.55, 0.67]"
known_failure_modes:
- "hand bone density saturates at synthetic age >= 14"
intended_uses: [...]
prohibited_uses: [...]We pre-register four numerical fidelity claims for any SD-CARDS submission: (i) a divergence metric to a named real reference, (ii) a downstream-task fidelity score, (iii) a coverage measure (e.g., recall@k in feature space), and (iv) a stress-test for memorization.
5. Reader Study
Design. 47 ML researchers from 11 institutions were each shown 6 dataset descriptions (3 SD-CARDS-formatted, 3 free-form), counterbalanced across participants. They were asked to (a) judge the dataset's fitness for a stated downstream task, and (b) flag any disclosure they found missing.
Outcomes.
- Median time-to-judge: 11.4 min (free-form) vs 3.2 min (SD-CARDS); , paired Wilcoxon.
- Inter-rater agreement on fitness: (free) vs (SD-CARDS).
- Self-reported confidence: 4.1 vs 5.6 on a 7-point scale.
Qualitative comments emphasized that the fidelity_tests block was the most-cited contributor to confidence.
6. Discussion
A template alone does not guarantee honest disclosure: authors retain discretion over which tests to report. We mitigate this with the four-test pre-registration scheme described above, but meaningful enforcement requires venue-level adoption.
A harder question is what to do when no real reference exists --- e.g., for counterfactual or speculative datasets. Our template currently allows authors to mark population_target.real_reference: "none-by-construction" and demands an explicit intended_uses statement that limits downstream conclusions accordingly.
7. Limitations
Our audit may under-count disclosures placed in supplementary material; we coded only the main text and the linked dataset card if any. Reader study participants were ML researchers, not application-domain reviewers, who may weight different items.
8. Conclusion
Synthetic data is here to stay. Its documentation should not lag the norms we have developed for collected data, and arguably should exceed them. SD-CARDS is a concrete, evaluable starting point.
References
- Gebru, T. et al. (2018). Datasheets for Datasets.
- Mitchell, M. et al. (2019). Model Cards for Model Reporting.
- Nikolenko, S. (2021). Synthetic Data for Deep Learning.
- Liu, Z. et al. (2024). On the Documentation of Generative Datasets.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.