Universal Scaling of Pretraining Generalization Gaps via Thermodynamic Analogies
Universal Scaling of Pretraining Generalization Gaps via Thermodynamic Analogies
1. Introduction
Neural scaling laws — the regular power-law dependence of pretraining loss on parameter count and data tokens — are now an empirical foundation of large-model research [Kaplan et al. 2020, Hoffmann et al. 2022]. Existing work focuses primarily on the training (or interchangeably validation, when overfitting is negligible) loss. Less attention has been paid to the gap between training and validation loss, despite this being the proper quantity to characterize generalization.
We document a clean and apparently universal scaling form for this gap and offer a thermodynamic interpretation that explains both its functional form and the observed crossover behavior at the boundary between data-bound and parameter-bound regimes.
2. Empirical Setup
We trained or collected 174 independent runs spanning parameters and tokens. Architectures span standard pre-LN transformers, parallel-block variants, and four MoE configurations. Data domains include common-crawl text, code, scientific papers, and a mixed pretraining corpus.
For each run we recorded the time series of train and validation loss and extracted the converged generalization gap
{\mathrm{val}} - \bar{\mathcal{L}}{\mathrm{train}}
where denotes a smoothed final-100-step average.
3. Empirical Scaling Form
We find that is well described by
with fitted exponents
The scaling function is approximately constant for (parameter-rich regime) and decays as with for (data-rich regime).
The data collapse, when plotting versus , yields a single curve across all 174 runs with residual scatter under 8%.
4. Thermodynamic Analogy
The collapse exponents and the crossover behavior have a natural interpretation if we treat pretraining as a thermodynamic process.
- Parameters play the role of a heat-bath degrees-of-freedom count.
- Data plays the role of work performed on the system.
- The gap is the entropic part of the free energy: the difference between the model's capacity to fit any training set and its actual fit to the population distribution.
Under a Maxwell-Boltzmann analogy with effective temperature controlled by stochastic gradient noise [Mandt et al. 2017], the equilibrium gap follows
where is the model's entropy. This rearranges to the empirical scaling form with , which evaluates to — within error bars of the empirical .
The crossover at corresponds to the equipartition point at which work injected via data balances the entropic capacity of the parameter heat bath.
5. Predictions and Tests
The thermodynamic picture makes several quantitative predictions.
- Tokenizer invariance: the exponents should not depend on tokenizer up to an effective-token rescaling. We verified this within error bars across BPE-32k, BPE-128k, and a byte-level tokenizer.
- Architecture invariance: the exponents should be architecture-invariant; the prefactor should change. Confirmed for MoE versus dense up to a 1.6x prefactor change.
- Temperature dependence: changing the SGD noise scale (via batch size at fixed learning rate) should rescale but not . Confirmed across batch sizes spanning .
All three predictions held within error.
6. A Compact Form
The data collapse is succinctly summarized by
log G ≈ log A − α log N + log f(D / N^z)
α = 0.31, z = 0.74, A = 0.86def predict_gap(N, D, A=0.86, alpha=0.31, z=0.74, beta=0.42, u_star=1.0):
u = D / (N ** z)
f = 1.0 if u <= u_star else (u / u_star) ** (-beta)
return A * (N ** -alpha) * f7. Implications
The crossover at is a useful design rule. For , doubling requires multiplying by to keep the system at the same point on the scaling curve — meaningfully different from the linear rule of [Hoffmann et al. 2022], which corresponds to .
Compute-optimal allocation under this exponent shifts somewhat toward parameters relative to Chinchilla-style recommendations, with a change in optimal at fixed FLOPs.
8. Discussion and Limitations
The analogy is suggestive, not derived from first principles. A genuine statistical-mechanical treatment in the spirit of [Bahri et al. 2024] is open work.
Most strikingly, the scaling holds for a converged gap. Out-of-equilibrium dynamics during training — early-phase grokking, late-phase plateaus — are not described. The full free-energy landscape likely contains more structure.
We also note that tracks generalization on the pretraining-distribution validation set; downstream task transfer follows different scaling and is not addressed here.
9. Conclusion
The pretraining generalization gap collapses onto a one-parameter family of curves with universal exponents that match a simple thermodynamic argument. The picture is internally consistent, predictive, and suggests modest revisions to compute-optimal scaling prescriptions.
References
- Kaplan, J. et al. (2020). Scaling laws for neural language models.
- Hoffmann, J. et al. (2022). Training compute-optimal large language models.
- Mandt, S. et al. (2017). Stochastic gradient descent as approximate Bayesian inference.
- Bahri, Y. et al. (2024). Statistical mechanics of deep learning.
- Maloney, A. et al. (2022). A solvable model of neural scaling laws.
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.