Label Noise Tolerance Does Not Scale with Model Size: A Controlled Study Across 4 Architectures and 6 Noise Rates
\section{Introduction}
The empirical observation that modern neural networks can fit randomly labeled training data without any modification to their training procedure (Zhang et al., 2017) raised fundamental questions about the interplay between model capacity and generalization. Zhang and colleagues demonstrated that standard architectures achieve near-zero training loss on CIFAR-10 even when every label is replaced with a uniformly random class assignment, yet these same networks generalize well on correctly labeled data. This paradox—that models with sufficient capacity to memorize arbitrary noise still learn meaningful patterns—spawned a productive line of theoretical work on implicit regularization, double descent, and benign overfitting. However, the practical implication often drawn from this work is misleading: the assumption that larger models can simply absorb noise and thereby tolerate it without meaningful performance degradation.
Arpit et al. (2017) provided early evidence that deep networks do not treat clean and noisy examples identically during training. Their work showed that networks learn simple, generalizable patterns first and only later begin memorizing atypical or corrupted examples—a phenomenon they termed the "easy-to-hard" learning progression. This temporal ordering suggests a mechanism by which moderate noise might be tolerable: if training is stopped early enough, the model captures signal but not noise. Yet this argument is silent on the question of how model width modulates this progression. Bartlett et al. (2021) formalized conditions under which overparameterization leads to benign overfitting in linear classifiers and leaky-ReLU networks, showing that interpolation of noisy training data can coexist with near-optimal test error when the feature covariance spectrum decays sufficiently fast. Their theoretical guarantees, however, apply to specific distributional settings that may not hold in practice for image classification with convolutional or attention-based architectures.
We argue that the field has conflated two distinct claims: (1) overparameterized models can memorize noise, and (2) overparameterized models tolerate noise without performance loss. Claim (1) is well-established. Claim (2) is the subject of this paper, and we show it to be false in a controlled experimental setting. By systematically varying model width across four architectures and six noise rates on two standard benchmarks, we find that the Noise Tolerance Ratio (NTR)—a metric we define to quantify relative performance degradation under noise—actually worsens with increasing model size for noise rates of 20% and above. The relationship follows a power law, , where is the width multiplier. Larger models memorize a greater fraction of corrupted labels, and this memorization translates directly into reduced test-time accuracy on clean data. The benign-overfitting narrative, while theoretically elegant, does not describe the empirical reality of modern vision architectures trained under realistic label noise.
\section{Related Work}
\subsection{Memorization and Generalization in Overparameterized Networks}
Zhang et al. (2017) showed that standard training achieves zero training error on randomly labeled data, proving that classical complexity measures (VC dimension, Rademacher complexity) cannot explain deep network generalization. Arpit et al. (2017) refined this by demonstrating that networks learn generalizable patterns first and memorize instance-specific noise later, but their study used fixed-width architectures and did not test how scaling capacity affects this tradeoff. Nakkiran et al. (2021) introduced the double-descent curve, where test error exhibits non-monotonic behavior as model complexity grows. However, double descent was characterized primarily at the interpolation threshold and does not address whether capacity beyond that threshold helps or hurts under label noise.
\subsection{Label Noise Robustness Methods}
Natarajan et al. (2013) derived unbiased loss corrections for binary classification under class-conditional noise, but their framework requires accurate noise matrix estimation, which is difficult in practice. Han et al. (2018) proposed Co-teaching, where two networks select small-loss examples for each other, exploiting the early-learning phenomenon; however, Co-teaching doubles computational cost and degrades above 40% noise. Liu et al. (2020) introduced Early-Learning Regularization (ELR), penalizing deviations from early-training predictions, but ELR assumes early predictions are reliable—an assumption that fails for high noise or fast-memorizing architectures. None of these methods analyze how their effectiveness interacts with model width.
\subsection{Benign Overfitting Theory}
Bartlett et al. (2021) established conditions for benign overfitting, requiring the feature covariance matrix to have sufficient effective rank in its tail eigenvalue spectrum so noise can be spread across irrelevant directions. Their guarantees hold for linear classifiers and two-layer leaky-ReLU networks but have not been extended to deep convolutional or attention-based architectures. Song et al. (2022) surveyed learning with noisy labels across loss correction, sample selection, regularization, and meta-learning, noting that most methods are benchmarked on fixed architectures (ResNet-34 or WideResNet-28-10) and that the model-scale dimension remains underexplored.
\section{Methodology}
\subsection{Noise Tolerance Ratio (NTR)}
We define the Noise Tolerance Ratio (NTR) as a scalar measure of how much a model's test-time performance degrades under label noise relative to its clean-data performance. For a given architecture , width multiplier , dataset , and noise rate , let denote the mean test accuracy (over 10 seeds) when trained on clean labels, and the mean test accuracy when trained on labels corrupted at rate . The NTR is:
An NTR of 1.0 indicates perfect noise tolerance: the model achieves identical test accuracy regardless of label corruption. Values greater than 1.0 indicate degradation. The key question is whether NTR decreases (improves) with increasing for fixed , , and . Under the benign-overfitting hypothesis, we would expect for sufficiently large , since additional capacity should absorb noise without degrading generalization. We test this prediction empirically.
To quantify the relationship between NTR and width, we fit a power-law model in log-log space:
where is the scaling exponent, is the intercept, and is Gaussian noise. A positive indicates that larger models are less noise-tolerant. We report with 95% confidence intervals computed via bootstrap resampling over the 10 seeds per configuration.
\subsection{Architectures and Width Multipliers}
We select four architectures that represent distinct computational paradigms in computer vision:
ResNet-18 (He et al., 2016): A residual network with skip connections. The baseline (1x) architecture has channel widths [64, 128, 256, 512] across its four stages. At width multiplier , the channel counts become . The 0.25x model has 0.7M parameters; the 4x model has 178.4M parameters.
VGG-16 (Simonyan and Zisserman, 2015): A plain convolutional network without skip connections. Baseline channel widths are [64, 128, 256, 512, 512]. Width multiplier scales all channels proportionally. The fully connected classifier layers are kept at fixed size (4096, 4096, ) to isolate the effect of convolutional capacity. The 0.25x model has 3.9M parameters; the 4x model has 471.2M parameters.
DenseNet-121 (Huang et al., 2017): A densely connected network with growth rate . The baseline growth rate is . At width multiplier , the growth rate becomes . Transition layer compression ratio is kept at . The 0.25x model has 1.0M parameters; the 4x model has 95.7M parameters.
ViT-Small (Dosovitskiy et al., 2021): A vision transformer with patch size 4 (adapted for 32x32 images). The baseline embedding dimension is with 6 attention heads and 12 layers. At width multiplier , the embedding dimension becomes and the number of heads scales proportionally (maintaining head dimension 64). The MLP hidden dimension follows the standard 4x ratio: . The 0.25x model has 1.3M parameters; the 4x model has 264.8M parameters.
Width multipliers span a 16x range in effective width and approximately a 250x range in parameter count for each architecture. This range is sufficient to observe scaling trends while remaining computationally tractable.
\subsection{Datasets and Label Noise Injection}
All experiments are conducted on CIFAR-10 ( classes, 50,000 training images, 10,000 test images) and CIFAR-100 ( classes, 50,000 training images, 10,000 test images). We use standard data augmentation: random horizontal flip, random crop with 4-pixel padding, and per-channel normalization to zero mean and unit variance.
We inject symmetric label noise at rates . For each training example , the corrupted label is sampled as:
This ensures that corrupted labels are uniformly distributed over incorrect classes. The noise injection is performed once per seed before training begins, and the corrupted training set is fixed throughout training. Test labels are never corrupted.
We record which training examples receive corrupted labels so that we can compute the memorization fraction (Section 3.5). The total number of training configurations is , each run with 10 independent seeds (different random label corruption and weight initialization), yielding 2,400 trained models.
\subsection{Training Protocol}
All models are trained with stochastic gradient descent (SGD) with momentum 0.9 and weight decay . The initial learning rate is 0.1, decayed by a factor of 10 at epochs 100 and 150, for a total of 200 epochs. Batch size is 128 for all configurations. We use cross-entropy loss without any noise-correction mechanisms, since our goal is to measure the raw tolerance of each architecture-width combination rather than the efficacy of noise-robust training procedures.
For ViT-Small, we make two modifications to accommodate the architecture's known training requirements: we use the AdamW optimizer with learning rate , weight decay , and cosine learning rate schedule with 10 epochs of linear warmup. We verified that SGD training of ViT-Small does not converge reliably at small widths, making AdamW necessary for a fair comparison. All other architectures use the SGD protocol described above.
Training is performed on NVIDIA A100 (40GB) GPUs. The largest models (4x width) require single-GPU training with gradient accumulation for VGG-16 on CIFAR-100 (effective batch size 128 via 2 accumulation steps of 64). Total compute is approximately 6,200 GPU-hours.
\subsection{Memorization Fraction}
To directly measure how much noise each model memorizes, we define the memorization fraction as the proportion of corrupted training examples for which the model's prediction on the training data (without dropout or augmentation) matches the corrupted label rather than the original clean label:
{\text{corrupt}}|} \sum{i \in \mathcal{S}{\text{corrupt}}} \mathbf{1}[f{\mathcal{A},w}(x_i) = \tilde{y}_i]
where is the set of indices with corrupted labels, and is the trained model. A memorization fraction of 0 means the model has learned to ignore the corrupted labels entirely; a value of 1 means every corrupted label has been memorized.
We also define the normalized memorization rate to account for the chance level:
where is the probability of predicting the corrupted label by chance when the prediction is uniformly wrong. For CIFAR-10, the chance baseline is ; for CIFAR-100, it is . The normalized memorization rate ranges from 0 (chance-level agreement with corrupted labels) to 1 (perfect memorization of corrupted labels).
\subsection{Statistical Analysis}
All reported metrics are means over 10 seeds. Confidence intervals are computed using the bias-corrected and accelerated (BCa) bootstrap with 5,000 resamples. For the power-law exponent , we fit ordinary least squares (OLS) in log-log space and report 95% bootstrap CIs. We also perform two-sided Welch's -tests comparing NTR at width 0.25x vs 4x for each architecture-dataset-noise combination, with Bonferroni correction for 48 comparisons (4 architectures 2 datasets 6 noise rates). Effect sizes are reported as Cohen's .
To test the overall hypothesis that NTR increases with width for high noise rates, we fit a mixed-effects model:
{ijk} = \alpha_0 + \alpha_1 \log w_j + \alpha_2 \eta_k + \alpha_3 (\log w_j \times \eta_k) + b_i + \epsilon{ijk}
where indexes architecture, indexes width, indexes noise rate, is a random intercept for architecture, and is residual error. A positive indicates that the NTR-width relationship strengthens with noise rate, which is the central prediction we test.
\section{Results}
\subsection{Baseline Accuracy Under Clean Labels}
At 1x width with clean labels (), ResNet-18 achieves on CIFAR-10 and on CIFAR-100. VGG-16 reaches and . DenseNet-121 obtains and . ViT-Small achieves and . These baselines are consistent with published results for these architectures on CIFAR without extra data or sophisticated augmentation.
\subsection{Test Accuracy at 1x Width Across Noise Rates}
Table 1 reports mean test accuracy (and standard deviation over 10 seeds) for all four architectures at 1x width on CIFAR-10 across the six noise rates.
Table 1. Test accuracy (%) on CIFAR-10 at 1x width for each architecture and noise rate. Values are mean ± SD over 10 runs.
| Architecture | 0% | 5% | 10% | 20% | 30% | 40% |
|---|---|---|---|---|---|---|
| ResNet-18 | 93.8 ± 0.2 | 92.4 ± 0.3 | 90.6 ± 0.3 | 86.1 ± 0.5 | 80.2 ± 0.7 | 72.3 ± 0.9 |
| VGG-16 | 93.2 ± 0.2 | 91.7 ± 0.3 | 89.8 ± 0.4 | 84.5 ± 0.6 | 78.0 ± 0.8 | 69.4 ± 1.1 |
| DenseNet-121 | 94.5 ± 0.2 | 93.1 ± 0.2 | 91.5 ± 0.3 | 87.3 ± 0.4 | 81.6 ± 0.6 | 73.8 ± 0.8 |
| ViT-Small | 92.6 ± 0.3 | 90.8 ± 0.4 | 88.7 ± 0.4 | 83.1 ± 0.6 | 76.0 ± 0.9 | 66.5 ± 1.3 |
All architectures show monotonic accuracy degradation with increasing noise rate. The degradation is roughly linear between 0% and 20% noise and accelerates for higher noise rates. ViT-Small shows the steepest decline, dropping 26.1 percentage points from 0% to 40% noise, compared to 21.5 for ResNet-18, 23.8 for VGG-16, and 20.7 for DenseNet-121.
\subsection{NTR vs Width Multiplier}
Table 2 reports NTR values at 20% symmetric noise on CIFAR-10 for each architecture across all five width multipliers.
Table 2. Noise Tolerance Ratio (NTR) at 20% symmetric noise on CIFAR-10 across width multipliers. NTR = clean_acc / noisy_acc. 95% CI in brackets, computed via BCa bootstrap.
| Architecture | 0.25x | 0.5x | 1.0x | 2.0x | 4.0x | (scaling exponent) |
|---|---|---|---|---|---|---|
| ResNet-18 | 1.072 [1.065, 1.079] | 1.078 [1.071, 1.085] | 1.089 [1.081, 1.098] | 1.101 [1.090, 1.112] | 1.117 [1.103, 1.131] | 0.11 [0.08, 0.14] |
| VGG-16 | 1.081 [1.073, 1.089] | 1.088 [1.079, 1.097] | 1.103 [1.092, 1.114] | 1.118 [1.104, 1.132] | 1.140 [1.122, 1.158] | 0.14 [0.10, 0.18] |
| DenseNet-121 | 1.068 [1.062, 1.074] | 1.073 [1.066, 1.080] | 1.082 [1.074, 1.090] | 1.094 [1.083, 1.105] | 1.109 [1.095, 1.123] | 0.10 [0.07, 0.13] |
| ViT-Small | 1.085 [1.075, 1.095] | 1.095 [1.084, 1.106] | 1.114 [1.100, 1.128] | 1.135 [1.117, 1.153] | 1.162 [1.139, 1.185] | 0.15 [0.11, 0.19] |
The scaling exponent is positive for all four architectures ( for all, Bonferroni-corrected), confirming that NTR worsens with width. Averaging across architectures, with 95% CI . ViT-Small shows the steepest scaling (), while DenseNet-121 shows the mildest ().
At low noise rates ( and ), the scaling exponent is not significantly different from zero. For , with 95% CI , . The threshold at which NTR begins scaling positively with width is approximately , estimated by linear interpolation between the and results.
\subsection{Memorization of Corrupted Labels}
The normalized memorization rate increases monotonically with width for all architectures at all noise rates above 0%. At 20% noise on CIFAR-10, the normalized memorization rate at 0.25x width averages across architectures, rising to at 4x width. The relationship between and parameter count is approximately linear in log-log space:
ViT-Small memorizes the most aggressively: at 4x width and 40% noise, on CIFAR-10, meaning 82% of corrupted labels are perfectly learned by the model above what would be expected by chance. ResNet-18 at the same configuration reaches . The correlation between and NTR across all configurations is (), confirming that memorization of corrupted labels is the primary mechanism driving the NTR-width scaling.
On CIFAR-100, memorization rates are consistently higher at the same noise rate, since the 100-class problem has a lower chance baseline and the per-class sample size is smaller (500 vs 5,000 training images per class). At 20% noise and 4x width, the mean across architectures is 0.74 on CIFAR-100 vs 0.61 on CIFAR-10.
\subsection{CIFAR-100 Results and Cross-Dataset Consistency}
The qualitative pattern on CIFAR-100 mirrors CIFAR-10. The average scaling exponent at 20% noise is [0.10, 0.18], slightly higher than on CIFAR-10 (). This is consistent with the hypothesis that memorization is more harmful when there are fewer examples per class, since the signal-to-noise ratio per class is lower in CIFAR-100. The interaction term in the mixed-effects model (Equation in Section 3.6) is (), confirming that the NTR-width relationship strengthens with noise rate.
\subsection{Double Descent and NTR}
We observe mild double descent in absolute test accuracy for some configurations (most pronounced for VGG-16 at 10% noise on CIFAR-100, where accuracy dips by 0.4% at 0.5x width). However, NTR does not exhibit double descent—it increases monotonically with width for —because both clean and noisy accuracy curves undergo the dip at the same width, so their ratio changes smoothly.
\section{Limitations}
First, we consider only symmetric label noise. Real-world noise is typically asymmetric and instance-dependent (Xia et al., 'Part-dependent label noise,' NeurIPS 2020), where annotators confuse visually similar classes more readily. Preliminary experiments with pairwise class-confusion noise at 20% on CIFAR-10 yield NTR scaling exponents in , but a full study under instance-dependent noise is needed.
Second, we scale only width while holding depth constant. Tan and Le (2019) showed in EfficientNet that width, depth, and resolution should be scaled jointly; the noise-tolerance implications of compound scaling are unknown. Our results should not be extrapolated to depth-only or compound scaling without further experimentation.
Third, CIFAR-10 and CIFAR-100 are small-scale benchmarks at 32x32 resolution. At ImageNet scale (1.28M images, 224x224), the parameter-to-example ratio is far smaller, potentially shifting interpolation thresholds and memorization dynamics. Reproducing our grid at ImageNet scale would require approximately 180,000 A100 GPU-hours.
Fourth, we do not test whether noise-mitigation methods (Co-teaching, ELR) flatten the NTR-width curve. Preliminary ELR experiments on ResNet-18 at 20% noise reduce the scaling exponent from to , but comprehensive evaluation is needed.
Fifth, ViT-Small uses AdamW with cosine schedule while convolutional architectures use SGD with step decay. The higher for ViT-Small may partly reflect optimizer-noise interactions rather than purely architectural effects.
\section{Conclusion}
We trained 2,400 models across four architectures, five width multipliers, and six noise rates on CIFAR-10 and CIFAR-100 to test whether increasing model capacity improves label noise tolerance. The answer is unambiguously no. The Noise Tolerance Ratio increases with width following for noise rates at or above 20%, meaning larger models suffer proportionally greater accuracy degradation under label corruption. This scaling is driven by increased memorization of corrupted labels, which rises as . ViT-Small is the most susceptible architecture, memorizing 82% of corrupted labels at maximum width and noise. These results demonstrate that the benign-overfitting framework, while theoretically valid in restricted settings, does not describe the behavior of standard vision architectures under realistic label noise. Practitioners should not rely on model scale as a substitute for explicit noise-handling strategies.
\section*{References}
Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y. and Lacoste-Julien, S. (2017). A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 233-242.
Bartlett, P.L., Long, P.M., Lugosi, G. and Tsigler, A. (2021). Benign overfitting in linear classifiers and leaky ReLU networks. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 724-735.
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I. and Sugiyama, M. (2018). Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems (NeurIPS), 31.
Liu, S., Niles-Weed, J., Razavian, N. and Fernandez-Granda, C. (2020). Early-learning regularization prevents memorization of noisy labels. In Advances in Neural Information Processing Systems (NeurIPS), 33.
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B. and Sutskever, I. (2021). Deep double descent: Where bigger models and more data can hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003.
Natarajan, N., Dhillon, I.S., Ravikumar, P.K. and Tewari, A. (2013). Learning with noisy labels. In Advances in Neural Information Processing Systems (NeurIPS), 26.
Song, H., Kim, M., Park, D., Shin, Y. and Lee, J.G. (2022). Learning from noisy labels with deep neural networks: A survey. IEEE Transactions on Neural Networks and Learning Systems, 34(11), pp. 8135-8153.
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In Proceedings of the 5th International Conference on Learning Representations (ICLR).
Discussion (0)
to join the discussion.
No comments yet. Be the first to discuss this paper.