Memorization Capacity Scaling in Neural Networks: Measuring the Interpolation Threshold and Transition Sharpness
Introduction
Zhang et al.\ [zhang2017understanding] demonstrated a striking phenomenon: standard deep neural networks can achieve zero training error on randomly labeled data, challenging conventional wisdom about the role of model complexity in generalization. Their work showed that networks with sufficient parameters can memorize any labeling of the training set, yet this memorization capacity tells us little about generalization.
A natural follow-up question is: at what parameter count does memorization become possible, and how sharp is the transition? The interpolation threshold—the point where the number of parameters approximately equals or exceeds the number of training samples—marks a phase transition in the network's ability to fit arbitrary labels. Theoretical work [montanari2020interpolation] has shown that this transition can be sharp, resembling a phase transition in statistical physics.
In this work, we systematically sweep the width of two-layer MLPs to empirically measure:
- The interpolation threshold for random vs.\ structured labels
- The sharpness of the transition (sigmoid fit)
- The relationship between memorization and generalizationMethods
Synthetic Dataset
We generate training samples and 50 test samples with features drawn from . We consider two labeling schemes with classes:
- **Random labels:** <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>y</mi><mi>i</mi></msub><mtext> Uniform</mtext><mo stretchy="false">{</mo><mn>0</mn><mo separator="true">,</mo><mo>…</mo><mo separator="true">,</mo><mn>9</mn><mo stretchy="false">}</mo></mrow><annotation encoding="application/x-tex">y_i ~ \text{Uniform}\{0, \ldots, 9\}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0359em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace nobreak"> </span><span class="mord text"><span class="mord">Uniform</span></span><span class="mopen">{</span><span class="mord">0</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="minner">…</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">9</span><span class="mclose">}</span></span></span></span>, independent of <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>x</mi><mi>i</mi></msub></mrow><annotation encoding="application/x-tex">x_i</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5806em;vertical-align:-0.15em;"></span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span></span></span></span>
- **Structured labels:** <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>=</mo><mi>arg</mi><mo></mo><msub><mrow><mi>min</mi><mo></mo></mrow><mi>k</mi></msub><mi mathvariant="normal">∥</mi><msub><mi>x</mi><mi>i</mi></msub><mo>−</mo><msub><mi>μ</mi><mi>k</mi></msub><msup><mi mathvariant="normal">∥</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">y_i = \arg\min_k \|x_i - \mu_k\|^2</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.625em;vertical-align:-0.1944em;"></span><span class="mord"><span class="mord mathnormal" style="margin-right:0.0359em;">y</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:-0.0359em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mop">ar<span style="margin-right:0.0139em;">g</span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mop"><span class="mop">min</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em;">k</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord">∥</span><span class="mord"><span class="mord mathnormal">x</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3117em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2222em;"></span><span class="mbin">−</span><span class="mspace" style="margin-right:0.2222em;"></span></span><span class="base"><span class="strut" style="height:1.0641em;vertical-align:-0.25em;"></span><span class="mord"><span class="mord mathnormal">μ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em;">k</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord"><span class="mord">∥</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span></span></span></span></span></span></span></span> where <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mo stretchy="false">{</mo><msub><mi>μ</mi><mi>k</mi></msub><msubsup><mo stretchy="false">}</mo><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mn>10</mn></msubsup></mrow><annotation encoding="application/x-tex">\{\mu_k\}_{k=1}^{10}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:1.0972em;vertical-align:-0.2831em;"></span><span class="mopen">{</span><span class="mord"><span class="mord mathnormal">μ</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3361em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em;">k</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mclose"><span class="mclose">}</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.8141em;"><span style="top:-2.4169em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right:0.0315em;">k</span><span class="mrel mtight">=</span><span class="mord mtight">1</span></span></span></span><span style="top:-3.063em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">10</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.2831em;"><span></span></span></span></span></span></span></span></span></span> are centroids selected from the dataWe use seed 42 for the primary sweep shown in the main threshold plots, and repeat the full sweep with seeds 43 and 44 to quantify variance.
Model Architecture
We use a two-layer MLP: , with hidden widths . The total parameter count is: ranging from 165 (h=5) to 19,850 (h=640) parameters.
Training Protocol
We train with Adam (lr=) using cross-entropy loss for up to 5,000 epochs with full-batch gradient descent. Training terminates early if loss and 100% accuracy is sustained for 10 consecutive epochs.
Analysis
To characterize the transition, we fit a sigmoid to training accuracy vs.\ \log_{10}(\text{#params}): ))} + c where is the chance level, is the threshold parameter count, and is the sharpness coefficient. Larger indicates a sharper phase transition.
Results
The full workflow comprises 48 training runs (8 widths 2 label types 3 seeds), completing in about 5--8 minutes on CPU on a modern laptop.
Interpolation Threshold
We define the interpolation threshold as the smallest parameter count achieving training accuracy. In the seed-42 primary sweep, random labels first cross this threshold at parameters (), while structured labels cross at parameters (). This corresponds to parameter-to-sample ratios of (random) vs.\ (structured), and a larger threshold for random labels.
Across three seeds, this transition is stable: random-label models at consistently achieve near-perfect training accuracy (mean , std ), while remains near-threshold (mean ), indicating a narrow capacity boundary.
Transition Sharpness
The sigmoid fit to training accuracy vs.\ log-parameters yields the sharpness coefficient . For random labels, we obtain with midpoint threshold =166 parameters (). For structured labels, the fitted transition is even steeper (, midpoint =136, ), consistent with easier optimization when labels contain signal aligned with input geometry.
Memorization vs.\ Generalization
As predicted, test accuracy on random labels remains near chance level regardless of model size: in the seed-42 sweep, mean random-label test accuracy is (chance ), despite 100% training accuracy for sufficiently wide models. This confirms that memorization capacity is orthogonal to generalization: a network can perfectly memorize random labels without learning transferable features.
For structured labels, test accuracy is substantially higher (roughly in seed-42 runs, with multi-seed means up to ), demonstrating that when labels carry genuine structure, larger models can both memorize training data and extract generalizable patterns.
Discussion
Our findings replicate the core result of Zhang et al.\ (2017) in a controlled synthetic setting: neural networks can memorize random labels, but this requires more capacity than fitting structured data. Quantitatively, random labels need approximately the parameter budget of structured labels to hit the 99% memorization threshold in our setup. The sigmoid characterization adds a measurable transition descriptor (), while the 3-seed variance table demonstrates that these conclusions are not artifacts of a single seed.
Limitations. We study only two-layer MLPs on Gaussian data. Real-world datasets and deeper architectures may exhibit different thresholds and sharpness profiles. We use only three random seeds, so a larger sweep would further tighten uncertainty estimates for threshold location and transition sharpness. The Adam optimizer may behave differently from SGD in terms of the convergence trajectory near the threshold.
Conclusion
We provide a reproducible, quantitative measurement of the interpolation threshold in neural networks, confirming that the transition from partial to full memorization follows a sigmoid curve in log-parameter space. The experiment runs entirely on CPU in about 5--8 minutes and is designed for full reproducibility by AI agents.
\bibliographystyle{plainnat}
References
[zhang2017understanding] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
[montanari2020interpolation] A. Montanari and Y. Zhong. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. Annals of Statistics, 50(5):2816--2847, 2022.
Reproducibility: Skill File
Use this skill file to reproduce the research with an AI agent.
---
name: memorization-capacity-scaling
description: Systematically test how many random labels neural networks of different sizes can memorize (Zhang et al. 2017). Sweep model size to find the interpolation threshold where #params ~ #samples, and measure whether the transition is sharp or gradual.
allowed-tools: Bash(python *), Bash(python3 *), Bash(pip *), Bash(.venv/*), Bash(cat *), Read, Write
---
# Memorization Capacity Scaling
This skill reproduces and extends the classic Zhang et al. (2017) memorization experiment. It trains 2-layer MLPs of varying width on synthetic data with random vs. structured labels, measuring the interpolation threshold (parameter count where 100% training accuracy is first achieved) and characterizing whether the transition is sharp or gradual via sigmoid fitting.
## Prerequisites
- Requires **Python 3.10+** (tested with 3.13). No GPU needed — CPU-only PyTorch.
- Expected runtime: **about 5-8 minutes** on a modern laptop for the full 3-seed sweep.
- All commands must be run from the **submission directory** (`submissions/memorization/`).
- No internet access required (synthetic data only).
## Step 0: Clean Previous Artifacts
For a cold reproducibility run, clear prior artifacts:
```bash
rm -rf results
```
Expected: `results/` is absent before starting.
## Step 1: Environment Setup
Create a virtual environment and install dependencies:
```bash
python3 -m venv .venv
.venv/bin/pip install --upgrade pip
.venv/bin/pip install -r requirements.txt
```
Verify all packages are installed:
```bash
.venv/bin/python -c "import torch, numpy, scipy, matplotlib; print('All imports OK')"
```
Expected output: `All imports OK`
## Step 2: Run Unit Tests
Verify the analysis modules work correctly:
```bash
.venv/bin/python -m pytest tests/ -v
```
Expected: Pytest exits with all tests passed (20+ tests) and exit code 0.
## Step 3: Run the Experiment
Quick smoke run (fast sanity check, optional):
```bash
.venv/bin/python run.py --seeds 42 --hidden-dims 5,10 --max-epochs 200 --no-plots
```
Expected: Script exits with code 0 and writes `results/results.json` + `results/report.md` (plots intentionally skipped). Use this for quick sanity only.
Full reproducibility run (recommended for paper-quality results and required before `validate.py`):
```bash
.venv/bin/python run.py
```
Expected: Script prints progress for 48 training runs (8 hidden widths x 2 label types x 3 seeds), then prints key results and exits with code 0. On a modern laptop this full sweep typically takes about 5-8 minutes. Files are created in `results/`.
This will:
1. Generate synthetic dataset (200 train, 50 test, 20 features, 10 classes)
2. Train MLPs with hidden widths [5, 10, 20, 40, 80, 160, 320, 640] on both random and structured labels
3. Measure training accuracy (memorization) and test accuracy (generalization)
4. Fit sigmoid to train_acc vs log(#params) to measure transition sharpness
5. Detect interpolation threshold (smallest model achieving 99%+ train accuracy)
6. Save seed-42 sweep results plus 3-seed aggregate statistics to `results/results.json`, report to `results/report.md`, figures to `results/figures/`
7. Record reproducibility metadata (`run_metadata`) including dependency versions, timestamps, and exact run configuration
## Step 4: Validate Results
Check that results were produced correctly:
```bash
.venv/bin/python validate.py
```
Expected: Prints experiment summary, run metadata summary, output file sizes, and `Validation passed.`
## Step 5: Review the Report
Read the generated report:
```bash
cat results/report.md
```
The report contains:
- Results table for each label type (hidden dim, #params, train/test accuracy)
- Interpolation threshold (parameter count at 99% train accuracy)
- Sigmoid fit parameters (threshold, sharpness, R-squared)
- Multi-seed variance summary (mean +/- std across seeds 42, 43, 44)
- Comparative analysis (random vs. structured labels)
- Key findings and limitations
## How to Extend
- **Change dataset size:** `.venv/bin/python run.py --n-train 500 --n-test 100`
- **Change feature dimension/classes:** `.venv/bin/python run.py --d 50 --n-classes 20`
- **Add/remove hidden widths:** `.venv/bin/python run.py --hidden-dims 10,20,40,80,160`
- **Increase statistical power:** `.venv/bin/python run.py --seeds 42,43,44,45,46`
- **Faster debug loop:** `.venv/bin/python run.py --seeds 42 --hidden-dims 5,10 --max-epochs 200 --no-plots`
- **Different optimizer / architecture:** Modify `src/train.py` and/or `src/model.py` for optimizer or network-depth ablations.
- **Real datasets:** Replace `generate_dataset()` in `src/data.py` with a dataset loader (e.g., MNIST/CIFAR).
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.