Double Descent in Practice: Reproducing the Interpolation Threshold Phenomenon with Random Features Models
Introduction
Classical statistical learning theory predicts a U-shaped bias-variance tradeoff: as model complexity increases, test error first decreases (reducing bias) then increases (due to variance). Modern deep learning practice contradicts this—very large, overparameterized models generalize well despite having far more parameters than training samples.
Belkin et al.[belkin2019reconciling] reconciled these observations by identifying the double descent curve, which subsumes the classical U-shape. The curve exhibits three regimes: (1) the classical regime where increasing capacity reduces error, (2) a critical peak at the interpolation threshold where the model has just enough capacity to fit the training data, and (3) the modern regime where further overparameterization yields smoother interpolating solutions.
Nakkiran et al.[nakkiran2019deep] demonstrated that double descent occurs not only as a function of model size, but also as a function of training epochs, and showed that label noise amplifies the phenomenon.
In this work, we reproduce these phenomena using a clean experimental setup: random ReLU features models with minimum-norm least-squares fitting on synthetic regression data. This setup, inspired by the theoretical framework of Advani & Saxe[advani2017high], provides an ideal testbed because the interpolation threshold is exactly at (number of features = number of training samples), and the solution is computed in closed form.
Methods
Data Generation
We generate synthetic regression data: with entries drawn from , true weights , and targets where . We use , , and , with noise levels .
Random Features Model
We employ a two-layer model with a fixed random first layer: Here and are fixed random projections, and is fit via minimum-norm least squares: where is the Moore-Penrose pseudoinverse. The number of trainable parameters is exactly .
Experimental Design
Model-wise sweep. We vary from 10 to 1000 (24 values), with dense sampling near the interpolation threshold . For each , we compute train and test MSE. This is repeated at three noise levels.
MLP comparison. For comparison, we train 2-layer MLPs with varying hidden width using Adam optimization (lr=0.001, 4000 epochs, no regularization).
Variance estimation. We repeat the random features sweep with 3 different random seeds to quantify variability.
Reproducibility controls. All dependencies are version-pinned in requirements.txt, every stochastic component is seeded, and the pipeline emits a SHA-256 fingerprint of scientific outputs. The validator recomputes this fingerprint from results.json to catch stale or corrupted artifacts before claims are made.
Results
Model-Wise Double Descent
Our experiments reveal a dramatic double descent curve. At low noise (), test MSE drops from 10.0 at to 1.3 at , then spikes to 312.0 at (the interpolation threshold), before decreasing to 0.11 at . This represents a peak-to-minimum ratio of approximately .
At higher noise (), the absolute peak is even larger (1,573) though the ratio is somewhat lower () because the baseline test error is higher. This confirms that label noise amplifies the interpolation peak in absolute terms.
Train-Test Decomposition
Training MSE decreases monotonically with and reaches exactly zero at . This is expected: at the threshold, the system of equations is exactly determined (assuming has full rank). For , the system is underdetermined and the minimum-norm solution achieves zero training error.
The critical insight is that the unique interpolating solution at is typically highly irregular, while the minimum-norm solution for is smoother. This explains the test error peak at .
MLP Comparison
Trained MLPs show a qualitatively similar pattern but with a less pronounced peak. The MLP test error peaks near (where #\text{params} \approx n) and gradually decreases for larger widths. The gentler peak is attributable to Adam's implicit regularization.
Discussion
Our results provide a clean, fast, and reproducible demonstration of the double descent phenomenon. The random features setup is ideal for this purpose because: (1) the interpolation threshold is exactly at , (2) the solution is computed analytically via pseudoinverse, (3) the entire experiment runs in seconds, and (4) the effect is extremely pronounced (ratios of 500--3000).
Limitations. Our setup uses synthetic linear-in-features regression, which is a simplified model. Real deep learning architectures exhibit double descent with additional complexities such as optimization dynamics, implicit regularization from SGD, and non-linear feature learning. Our MLP comparison partially addresses this gap. Additionally, epoch-wise double descent may not manifest with tiny MLPs and the Adam optimizer; it typically requires larger models, SGD, and longer training[nakkiran2019deep].
Broader implications. The double descent phenomenon has important practical implications: (1) the traditional approach of selecting model complexity via a validation set may miss the overparameterized regime, (2) adding more parameters can improve rather than hurt generalization, and (3) the interpolation threshold is a dangerous regime to be avoided in practice.
\bibliographystyle{plainnat}
References
[belkin2019reconciling] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849--15854, 2019.
[nakkiran2019deep] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: Where bigger models and more data can hurt. arXiv preprint arXiv:1912.02292, 2019.
[advani2017high] M. S. Advani and A. M. Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: double-descent-in-practice
description: Systematically reproduce the double descent phenomenon (Nakkiran et al. 2019, Belkin et al. 2019) using random features models and MLPs on synthetic regression data. Demonstrates model-wise double descent, noise amplification, epoch-wise dynamics, and variance analysis — all on CPU in about 15-25 seconds.
allowed-tools: Bash(python *), Bash(python3 *), Bash(pip *), Bash(.venv/*), Bash(cat *), Read, Write
---
# Double Descent in Practice
This skill reproduces the **double descent phenomenon** — where test error first decreases, then increases sharply at the interpolation threshold, then decreases again — using random ReLU features models and trained MLPs on synthetic data.
## Prerequisites
- Requires **Python 3.10+**. No internet access or GPU needed.
- Expected runtime: **about 15-25 seconds** on CPU.
- All commands must be run from the **submission directory** (`submissions/double-descent/`).
## Step 1: Environment Setup
Create a virtual environment and install dependencies:
```bash
python3 -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
```
Verify all packages are installed:
```bash
.venv/bin/python -c "import torch, numpy, scipy, matplotlib; print(f'torch={torch.__version__}'); print('All imports OK')"
```
Expected output:
```
torch=2.6.0
All imports OK
```
## Step 2: Run Unit Tests
Verify the analysis modules work correctly:
```bash
.venv/bin/python -m pytest tests/ -v
```
Expected: All tests pass (49 tests). Exit code 0.
## Step 3: Run the Analysis
Execute the full double descent analysis:
```bash
.venv/bin/python run.py
```
Expected: Script completes in about 15-25 seconds on CPU. Prints progress `[1/4]` through `[6/6]` and exits with code 0.
Also prints a deterministic `Results fingerprint: <sha256>`.
This will:
1. Generate synthetic noisy regression data (n=200, d=20).
2. Sweep random-feature width from 10 to 1000, crossing the interpolation threshold at p=200, for 3 noise levels (sigma=0.1, 0.5, 1.0).
3. Sweep MLP hidden width for comparison.
4. Track MLP test loss over epochs at the interpolation threshold.
5. Repeat with 3 random seeds for variance estimation.
6. Generate 5 publication-quality plots and a summary report.
Output files created in `results/`:
- `results.json` — all raw experimental data.
- `report.md` — summary of findings.
- `model_wise_double_descent.png` — test MSE vs. feature count (3 noise levels).
- `noise_comparison.png` — overlay showing noise amplifies double descent.
- `epoch_wise_double_descent.png` — test MSE vs. training epoch at threshold.
- `mlp_comparison.png` — random features vs. trained MLP side-by-side.
- `variance_bands.png` — mean +/- std across random seeds.
## Step 4: Validate Results
Check that results were produced correctly and double descent was detected:
```bash
.venv/bin/python validate.py
```
Expected output includes:
- Runtime under 180s.
- Fingerprint check passes (`Fingerprint OK ...`).
- Peak/min ratio >> 1 for all noise levels (confirming double descent).
- All 5 plot files present.
- Report generated.
- Final line: `Validation passed.`
## Step 5: Review the Report
Read the generated summary:
```bash
cat results/report.md
```
Expected: Markdown report with setup, results tables, and key findings including:
- Model-wise double descent confirmed with peak at p=n=200.
- Peak-to-minimum ratio of several hundred to several thousand.
- Noise amplification effect.
- Benign overfitting in the overparameterized regime.
## How to Extend
### Different data dimensions
In `src/sweep.py`, modify `run_all_sweeps()` config parameters:
- Change `d` for different input dimensions.
- Change `n_train` to shift the interpolation threshold.
- Change `noise_levels` to explore different noise regimes.
### Different variance setting
- Set `variance_noise_std` in `run_all_sweeps(config=...)` to choose which noise level is used for seed-wise variance bands.
- If omitted, the variance study defaults to the highest noise level from `noise_levels`.
### Different model types
- Add new model classes in `src/model.py` (e.g., deeper MLPs, random Fourier features).
- Create corresponding sweep functions in `src/sweep.py`.
### Classification tasks
- Modify `src/data.py` to generate classification data.
- Replace MSE with cross-entropy loss in `src/training.py`.
- Update analysis metrics accordingly.
### Regularization study
- Add weight decay or dropout to the MLP in `src/training.py`.
- Compare double descent curves with/without regularization.
## Key Scientific References
1. Nakkiran et al. (2019) "Deep Double Descent: Where Bigger Models and More Data Hurt" — arXiv:1912.02292
2. Belkin et al. (2019) "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-off" — PNAS 116(32)
3. Advani & Saxe (2017) "High-dimensional dynamics of generalization error in neural networks" — arXiv:1710.03667
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.