Curriculum Distillation from Multi-Teacher Ensembles for Compact Language Models
Curriculum Distillation from Multi-Teacher Ensembles for Compact Language Models
1. Introduction
Knowledge distillation [Hinton et al. 2015] has become a workhorse technique for compressing capable but expensive LLMs into deployable student models. Practitioners increasingly distill from ensembles of teachers — combining e.g. Llama-3-70B (general), CodeLlama-34B (code), and a math-specialist — rather than from a single oracle. However, naive multi-teacher KD treats all teachers equally, which is suboptimal when teachers disagree, when one teacher is wrong, or when the student's capability frontier shifts during training.
We propose CurDist, a curriculum-distillation algorithm that (a) adaptively reweights teachers per example and (b) schedules training examples in order of increasing teacher disagreement.
2. Background
Given teachers producing distributions , standard multi-teacher KD minimizes
with fixed weights . This formulation cannot express which teacher is the right oracle for a given , nor does it order training examples.
3. Method
3.1 Per-example teacher reweighting
We set
where is the geometric-mean ensemble and is a learnable competence prior.
3.2 Curriculum scheduling
We define example difficulty as the Jensen-Shannon divergence among teachers:
We sort the corpus by and feed examples in increasing- order with a sliding window of width 2{,}048 (re-shuffled each epoch). The intuition: easy examples (consensus among teachers) provide stable gradient signal early, while hard examples (disagreement) are introduced once the student can handle nuance.
3.3 Training Loop
for x in sorted_by_difficulty(corpus):
teacher_dists = [t.predict(x) for t in teachers]
weights = adaptive_weights(teacher_dists, comp_priors)
target = mix(teacher_dists, weights)
loss = kl(target, student.predict(x))
loss.backward()4. Experimental Setup
Teachers: Llama-3-8B, Mistral-7B, CodeLlama-7B, Qwen-1.5-14B, Llama-3-70B. Student: a 1.3B decoder-only model with the same tokenizer as Llama-3. Corpus: 22B tokens from a mixture of OpenWebMath, RedPajama-V2, and CodeParrot. Hardware: 64 H100s, 6 days of training for the main run.
Baselines: (a) uniform-weight multi-teacher KD; (b) best-single-teacher KD (Llama-3-70B); (c) static-weight oracle (weights tuned on val).
5. Results
| Method | MMLU | HumanEval | GSM8K | Tokens (B) |
|---|---|---|---|---|
| Best-single (70B) | 39.6 | 21.3 | 12.4 | 22.0 |
| Uniform MT-KD | 41.2 | 24.0 | 13.7 | 22.0 |
| Static-weight oracle | 42.0 | 24.8 | 14.1 | 22.0 |
| CurDist | 43.4 | 26.5 | 15.2 | 13.6 |
CurDist achieves the strongest results while consuming 38% fewer tokens (early stopping triggered when student's val loss plateaus, which occurs sooner under curriculum).
5.1 Ablations
- Removing curriculum (random shuffle, adaptive weights only): MMLU drops to 42.1.
- Removing adaptive weights (curriculum only): MMLU drops to 41.7.
- Both contribute; their effects are roughly additive.
6. Analysis
We observe that early in training CurDist heavily weights the smaller teachers (Mistral-7B, Llama-3-8B), which agree on easy examples. Mid-training, the weight on CodeLlama-7B spikes on code tokens (as expected). Late training is dominated by Llama-3-70B on hard reasoning tasks. The algorithm thus re-discovers a sensible curriculum without explicit domain labels.
7. Limitations
The difficulty estimator requires running all teachers on every example, which is expensive at corpus scale. We mitigate by computing once and caching, but the up-front cost is real. The method also assumes teachers share a tokenizer; cross-tokenizer distillation requires alignment heuristics we did not explore here.
8. Conclusion
Curriculum scheduling based on teacher disagreement, combined with per-example adaptive weighting, materially improves multi-teacher distillation. CurDist offers a practical recipe for compressing heterogeneous LLM ensembles into compact deployable students.
References
- Hinton, G., Vinyals, O., Dean, J. (2015). Distilling the Knowledge in a Neural Network.
- Wu, Q. et al. (2021). One Teacher is Enough? Pre-trained Language Model Distillation.
- Bengio, Y. et al. (2009). Curriculum Learning.
- Gou, J. et al. (2021). Knowledge Distillation: A Survey.
- Touvron, H. et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.