Curriculum-Aware Synthetic Data Generation for Mathematical Reasoning
Curriculum-Aware Synthetic Data Generation for Mathematical Reasoning
1. Introduction
The last eighteen months have seen synthetic problem generation become the workhorse of math reasoning model training. Pipelines such as those described by [Yu et al. 2024] and [Toshniwal et al. 2024] produce millions of problems by prompting a strong teacher model with seed problems and accepting outputs that pass a verifier. These pipelines, however, generally sample uniformly from whatever the teacher emits, producing distributions skewed toward problems the teacher finds easy.
We argue that the shape of the difficulty distribution matters at least as much as its size, and that aligning it to a curriculum yields measurable gains.
2. Background
Classical curriculum learning [Bengio et al. 2009] presents easier examples first, gradually increasing difficulty. In LLM fine-tuning, however, the dynamics are more nuanced: gradients on too-easy examples produce vanishing learning signal, while problems above a model's current capability often produce confidently-wrong rationales that the model then memorizes [Zelikman et al. 2022].
We define problem difficulty operationally. Let be a teacher model and let be a generated problem. Define
where are stochastic samples from the teacher and is the verified ground-truth answer. Difficulty is bounded in and discretized into ten bins.
3. Method
Our generator proceeds in three stages.
Stage A (Generation). A teacher emits raw candidate problems by mutation of a seed bank of 38,000 olympiad and textbook problems. We accept candidates whose verifier-checked answer matches the proposed solution.
Stage B (Difficulty estimation). We sample teacher rollouts per accepted problem, compute , and assign the problem to a difficulty bin.
Stage C (Curriculum resampling). We resample with replacement to match a target schedule over training step . We use a sigmoidal schedule
with , , , advancing the curriculum from easy to hard over training.
def sample_batch(problems, step, total_steps, sigma=0.18):
mu = 0.2 + 0.5 * (step / total_steps)
weights = np.exp(-((problems.difficulty - mu) ** 2) / (2 * sigma ** 2))
weights /= weights.sum()
return np.random.choice(problems, size=batch_size, p=weights)4. Experimental Setup
We fine-tune a 7B base model on 1.2M problems for three epochs with AdamW, lr , cosine decay. The compute budget is fixed at 14,400 H100-hours; the curriculum-aware variant uses additional teacher samples for difficulty estimation, accounted for in this budget.
We compare four conditions: (i) Uniform — sample uniformly from accepted problems; (ii) Easy-first — train only on problems; (iii) Hard-first — train only on ; (iv) Curriculum — our method.
5. Results
| Method | MATH | GSM8K | Olympiad-bench |
|---|---|---|---|
| Uniform | 47.3 | 88.1 | 19.4 |
| Easy-first | 44.0 | 88.7 | 14.2 |
| Hard-first | 39.1 | 81.6 | 21.0 |
| Curriculum | 52.8 | 89.0 | 24.1 |
The curriculum gain on Olympiad-bench is 4.7 absolute points (, paired bootstrap, ). Confidence intervals are tight on GSM8K because the benchmark is near-saturated.
We also observed that the Hard-first condition exhibited training instability: roughly 11% of training steps produced gradient norms exceeding of the running mean, and final loss curves had higher variance across seeds (CoV 0.18 vs. 0.07 for curriculum).
6. Failure Mode: Distributional Collapse
We noticed that beyond the teacher's accepted problems exhibit reduced lexical diversity (token-2gram entropy drops 22%). This is consistent with the teacher reusing a small repertoire of hard templates. Models trained on this slice memorize template structure rather than learning generalizable reasoning. Practitioners should cap the hardest bin or supplement it with human-curated problems.
7. Discussion and Limitations
Our difficulty signal depends on the teacher's competence. If the teacher cannot solve a class of problems at all, saturates at 1.0 and the curriculum loses resolution at the top. A multi-teacher ensemble may mitigate this.
We have not tested transfer to non-mathematical reasoning domains; the curriculum hyperparameters were tuned on math and may need re-tuning for code or proof generation.
8. Conclusion
Difficulty-aware resampling is a low-cost addition to existing synthetic data pipelines and produces consistent gains on harder benchmarks. The mechanism — matching gradient signal to current model capability — is general, and we expect related schedules to help in adjacent domains.
References
- Bengio, Y. et al. (2009). Curriculum learning.
- Yu, L. et al. (2024). MetaMath: Bootstrap your own mathematical questions.
- Toshniwal, S. et al. (2024). OpenMathInstruct.
- Zelikman, E. et al. (2022). STaR: Self-taught reasoner.
- Hendrycks, D. et al. (2021). Measuring mathematical problem solving with the MATH dataset.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.